How to Deal With the Challenges of Unleashing the Four Horsemen of Big Data
Sidestep the stampede and grab the reins.
Once you have tapped your data source for your project, be very afraid. You have unleashed the four horsemen of the Big Data.
You have Volume, Variety, Veracity, and Velocity, not Death, Famine, War, and Conquest.
Each aspect has challenges, and you need to address them when you work with your data.
Let’s look at each one individually.
For the hoard!
If you have ever walked into a house owned by a hoarder, you never expect the sheer amount of “stuff” that’s shoved into every crevice of their home.
Newspapers and magazines can be stacked up on the walls. You may be unable to open up some doors with the weight of unknown objects stored behind them. The floors may be filled with debris, making navigating almost impossible. You don’t know how many dumpsters are needed to clean the place.
You feel overwhelmed.
This amount of precious things that nobody ever threw away is similar to the first “V” of Big Data - Volume
Volume is a problem when dealing with data. There's a huge quantity of data needed for AI. It can come from all sorts of different locations. Think about all of the data you generate every day on your phone, on your laptop, and on your desktop CPU.
Large organizations may need to use billions of bytes of data for AI, and their data will continue to grow as their customers and clients grow. This leads to data storage issues. Storage becomes complicated, and the data may grow too fast for the company to handle effectively.
Data retrieval becomes unwieldy once a source becomes too large, and querying vast amounts of data is troublesome. Another problem is that the home office may not store data and may distribute it throughout the organization. Managing collections is increasingly challenging because many locations require access.
Once you have accessed the location, it must be integrated, which can be a chore.
Volume is a harbinger of the next horseman - Variety.
Your own computer desktop is a window to the soul
If your computer desktop is anything like mine, it has random files in all sorts of different formats.
There’s a funny cat video. There's a PDF of instructions, a .JPG of a meme, the Excel of our budget, and a few AVI files. Every time I get a new program, it seems to have its own proprietary file format. Sometimes, I need to look up file suffixes since I don’t even know what file format it is.
This smorgasbord of files may not bother us, but dealing with different types of data in an AI project can generate some real problems.
This hodgepodge of file formats is the next horseman of Big Data - Variety.
Most data in an organization is unstructured. This includes email, documents (PDFs, Word files, etc.), images (JPGs, PNGs), video, messages, and voicemail (anyone save these?).
Another 10% is semi-structured data. This can include data like log files, JSON files, and API messages.
The last bit of data is structured data, which consists of Excel files, SQL databases, and tabular data, such as information found in software such as reservation systems. Tables play very nicely with data retrieval - that’s why it’s so fast.
The challenge is if you need to get to the data and it’s in many places and you need to combine it all. The quality can vary - for example, some videos may be 4k, and other videos may be low-res.
When you have all of this variety, you may run into problems querying the data.
Data types in a changing world usually increase with the number of apps released. More types may be created or added. They then need to be integrated.
Once you have dealt with Variety we deal with the next rampaging bringer of chaos - Veracity.
Trust but verify your… data?
As a project manager one of my principles is “Trust but verify.” I can’t tell you how many times I was burned before I learned to ingrain that mantra into my work habits.
But what about verifying your data? You can't sit down and ask it questions. Or, send a Teams message saying, "Hey, just following up on this…" Remember another old saying - "Garbage in, garbage out.’
If your data is bad, you will get bad results, which brings us to the third horseman of Big Data - Veracity.
First off, Veracity is a weird word. I always think it's referring to the small dinosaurs in Jurassic Park. They are intelligent, cunning, and have a lot of pointy teeth. If your data is of low quality, it can bite you in the a$$ like one of those little menaces. You need to watch out for this sneaky bugger.
Veracity is about the quality of your data.
When you have a lot of data, you never know what state it's all in. When pressed for time on a project, you may not have time to check to see if it’s the best quality.
When you start combing through the data, it may seem like it came from a back alley somewhere - how do you know if you can trust it? You need to know the sources, but some may be unfamiliar.
If you are combining data from different sources, consistency issues could arise. The files from your office in Taiwan may not match those from your office in the United States.
Maybe the data is old - your team may be using version 3 of the spec, and another team is using version 2, which may have accuracy problems (which is why it was updated!).
Other problems may pop up - data noise, bias, or anomalies may make it into the dataset that you don’t want to integrate into your project.
After sweating out if your data was reliable, let’s look at the last horseman - Velocity.
Drink from the firehose
When I was a kid, when you got thirsty from playing outside and didn’t want to go in - you drank from the hose.
A wiseass friend would step on the hose when you drank from that plastic fountain of goodness. The water would blast you in the face when they took their foot off.
The same experience may await you when you deal with the fourth horseman of Big Data - Velocity.
Data is constantly changing and moving quickly from one place to another. So it’s hard to process it accurately if it’s in a massive amount.
Even live data does not move at predictable rates, speeds, or volumes. So, predicting how it will flow through data pipelines is difficult. Even if you think you are on top of monitoring the data, glitches or anomalies can always happen.
Customers need instant recommendations and predictions. The demand only gets faster since everyone wants data instantaneously. The increasing quantities of data are pushing the systems that deal with this data to the brink.
The problem is that few data scientists and engineers know how to handle, process, and analyze Big Data effectively. This is why these individuals make eye-popping salaries and are in significant demand.
Working with the heart-stopping pace of Big Data is daunting.
It takes experience, and finding the people to help our project succeed is a constant search.
Final thoughts
Dealing with Big Data has heart-stopping challenges. I hope this article will help you see some critical aspects. These four concepts are key when dealing with lots of data.
If your team has data scientists and engineers who look stressed, they've got a lot to deal with at work!
Make sure to include them on your next coffee run.
They'll need all the caffeine to keep tabs on the four data horsemen.
AI-Driven Tools for PMs (AI tools for test data generation)
Superhuman - The fastest email ever made.
Brillant - A platform for learning about LLMs (think ChatGPT) and how to use them.
PDF Flex - Ask questions about a PDF's content, and get immediate detailed responses.
AI News PMs Can Use
The CEO of Zoom wants AI clones in meetings
AI in the Workplace: Slack Study Points at Rapid Growth and Mixed Emotions
Cool ChatGPT Prompt for PMs
Prompts to customize ChatGPT for various project methodologies
Create a set of ChatGPT prompts tailored for Scrum project management, focusing on sprint planning and backlog management.
Design a series of ChatGPT prompts for use in a Waterfall project management environment, emphasizing milestone tracking and documentation.
Generate a collection of ChatGPT prompts customized for Kanban project management, with a focus on workflow visualization and task prioritization.