Summary: We searched the web to find all of the most common myths and misconceptions about Big Data. There were a lot more than we thought. Here’s what we found. Part 1.
The web is just chockablock with good articles that claim to dispel the most common misconceptions or myths about Big Data. Since even today there are still a larger percentage of companies, particularly SMBs that haven’t embraced Big Data or advanced analytics I thought it would be interesting to try to catalogue these misconceptions.
It turned out not to be that easy. Reviewing just a dozen or so of these articles yielded 68 myths to be debunked! Wow. Yes there was some duplication (6 votes for ’It’s a Fad’, 5 votes for ‘It’s the Solution to All Our Problems’, and so forth). But many were unique or had unique perspectives on common misunderstandings.
With a little artful editing I’ve condensed these into 14 major categories (myths or misconceptions). Since there are so many we’ll divide this into two articles. Watch for Part 2.
(Just Hype. Just a Fad. Just Doing More of What You’re Already Doing Today. Value is Always 5 Years Away.)
If it’s a fad it’s pretty durable since it’s been around for over 15 years and heavily commercialized for at least the last 10. There are dozens of public companies who have built multi-billion dollar valuations from capturing the newly available value of Big Data. If your internal conversations about adopting Big Data focus on making the reports run faster or meeting existing SLAs with even more data then you are missing the point. Big Data is about releasing the value in semi-structured and unstructured data and data-in-motion specifically with an eye to predicting future actions (predictive analytics), not making your EDW slightly faster (descriptive analytics). If you don’t adopt you will be left behind.
(Big Data is Good. Big Data is Magic. Big Data Always Leads to Big Changes. Big Data Offers Concrete and Precise Solutions.)
There’s a difference between having a lot of data and having a lot of good and meaningful data. Understanding what is good and valuable is a critical step. None of this happens with the wave of a magic wand. You’ll need people with the skills and experience to make use of these new insights and a culture that is data-driven and willing to accept these new findings. Without executive support and active programs that support data-driven initiatives the value of Big Data drops to near zero and change will not occur.
As for concrete, precise solutions, that will depend on the skills of your data scientists, analysts, and managers. Even so, many of the analytic techniques that rely on Big Data offer only directional guidance. Guidance from predictive models for example can under some circumstances be highly accurate in guiding actions, even in the range of say 70% to 90%. Other techniques like recommenders or sentiment analysis are only intended to give you advance notice that a trend is developing that you can investigate and exploit.
Big Data can be messy, noisy, and inconsistent. Taking the time to extract the value is not trivial. But without it you won’t stand a chance of discovering the insights it reveals.
(It’s all about size!)
I hope as a reader of data science material you can immediately spot the giant error in this statement. It’s unfortunate that the whole Big Data movement ever got that name since it leads the uninitiated to focus on ‘big’. To be ‘big’ in Big Data you need many Petabytes of data and very few of us will ever accumulate or analyze data sets of that size. The real magic of Big Data is in its other two characteristics 1.) the ability to store and retrieve unstructured and semi-structured data that our legacy RDBMS systems didn’t like at all, and 2.) the wholly new arena of data in motion (velocity) that’s at the heart of IoT and streaming analytics.
(More Data Equals Better Decisions. More Data Eliminates Uncertainty.)
Truly a myth. In fact as the amount of data increases the difficulty in correctly evaluating it grows right alongside. More data can mean an increased likelihood of drawing an inaccurate conclusion. We are looking for smart or significant levels of data which may be much smaller than all that we may be able to collect.
Even when a trend or correlation is correctly identified, especially where human behavior like buying is involved it will never be completely accurate. Systematic risk from information and analysis is never fully eliminated. The amount of data you collect does not protect you, only your skills in analyzing the data.
(The More Data the Better. Data is Inherently Valuable. We Must Capture Everything in Order to Analyze It. There’s Better Result from Simple Analytics on Big Data than from More Sophisticated Analytics on Small Data.)
There’s a lot of chest thumping by the vendors of Big Data platforms that can ingest and perform advanced analytics on huge data sets. In practice however being efficient with our data science time generally dictates that we start with small data (a subset of the whole) and conduct our analysis there. Only when we’ve decided on the most successful methods of performing that analysis are we inclined to try to incrementally improve results by using larger and larger samples. The returns on these incrementally larger samples often don’t warrant the time invested.
There are certainly exceptions to this observation, especially when real time Big Data is streaming into dashboards, visualizations, or real time scoring. However the first step of establishing the most desirable analytic approach can almost always be built on a small subset of a very large data set.
Pete Eppele, SVP, Products and Science, at Zilliant says, “From my perspective, the biggest misconception is that there’s more benefit and better ROI available from simple [descriptive] analytics on big data than from more sophisticated, predictive analytics on little(r) data.”
“Managers often think that the next wave of benefit from data-driven decision making comes from amassing and analyzing a huge volume of data from a variety of different internal and external sources. They then start down the path of applying simple [descriptive] analytics to the data set to see what sales or marketing insights they can glean. Simple analytics, particularly when they are backward looking in nature and don’t have the ability to separate true signal from noise, can be confusing and, worse yet, misleading.”
Data has no inherent value without analysis and converting that analysis to action within the company.
(Big Data is Made for Big Businesses. Big Data Doesn’t Apply to Me.)
In the beginning of Big Data there was a sense that only the largest companies produced enough data to warrant adoption of this tech. The flaw in this thinking is wrapped up in the word ‘big’ since the real technological advantage is much more about the variety of data that can be processed (structured, unstructured text, voice, images) and about data in motion (velocity).
Companies of all sizes have recognized the advantages of Big Data tech in improving company decision making and even for the creation of wholly new data products from data that was previously thought to have little or no value. The competitive superiority of Big Data adopters has proven a powerful motivator for companies of all sizes.
(Big Data is Complicated. Big Data is Expensive to Implement and Maintain. Big Data is Too Hard to Use. You Can’t Find People With the Necessary Skills.)
Like many technologies, when Big Data was first commercialized around 2006 the early adopters tended to be larger companies. Resources were scares as were the skilled individuals to implement and maintain the new tech. However, today Big Data is available with cloud deployment and with greatly enhanced management tools and its cost and complexity are greatly reduced.
Acquiring a Hadoop implementation from one of the big three distributors is mostly a low-five-figure investment and significantly lower if you go with the cloud from the likes of Amazon, Microsoft, or Google. This route eliminates almost all the cost of physical servers.
A hybrid on-prem/cloud strategy that’s very popular today is to put DEV, TEST, and Backup on the cloud while leaving sensitive operating systems on-prem.
While there is a learning curve with NoSQL databases the fact that they are late-schemaed (schema on write) greatly reduces or outright eliminates most DBA requirements and makes changes to applications much easier and more rapid.
Be sure to watch for Part 2 where we’ll debunk these misconceptions:
About the author: Bill Vorhies is Editorial Director for Data Science Central. and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: