Summary: Continuing from out last article, we searched the web to find all of the most common myths and misconceptions about Big Data. There were a lot more than we thought. Here’s what we found. Part 2.
If you caught Part 1 of this article you know that we set out to catalogue all the common misconceptions and myths about Big Data. In all we identified 68! Eliminating some overlap but trying to retain the nuances of different ways to explain the same myth we were able to group these into 14 major categories (myths or misconceptions).
In Part 1 we covered:
Here in Part 2 we’ll share seven more interesting and misleading myths and misconceptions.
(Tackling Big Data means throwing out everything and starting new. Can’t I just analyze Big Data with my existing software tools? Big data will eliminate data integration. Big Data analytics platforms will replace the data warehouse. Data lakes will replace the data warehouse. No point using a data warehouse for advanced analytics.)
Some of the resistance to Big Data is a lack of understanding about how this technology can be integrated into the existing technology stack. No one is suggesting that you throw out your hard-earned and well-designed data warehouse. Big Data technology and architectures play well with existing RDBMS structures and are supplemental.
NoSQL DBs are different from RDBMS and have substantial advantages for the purposes for which they were designed. This includes rapid ingest of whole new types of data without having to define a schema in advance and making that data available for easy investigation increasingly without the direct involvement of IT staff.
The concept of a Data Lake is a temporary repository of unstructured data (similar in concept to a data mart) where analysts can explore the new data, combine it with existing structured data, and see what value can be extracted. Data Lakes are not meant to be permanent repositories and once data has been identified as valuable then more formal procedures for its capture, storage, and retrieval (typical SLAs) need to be created. Albeit for storage in a NoSQL DB like Hadoop, not in your EDW or any other RDBMS.
These are supplemental to your EDW although some companies find they can reduce the load on EDWs by laying information off to NoSQL DBs.
Not to confuse this explanation too much, there is also the whole new arena of NewSQL DBs which are essentially RDBMS data bases with all the advantages of NoSQL, particularly MPP and the use of commodity hardware. There are now offerings especially among cloud-based NewSQL DBs that can in fact replicate existing EDWs at much lower cost in both hardware and maintenance manpower.
Can you make your existing RDBMS or EDW do what NoSQL can do? Can you just continue to use your existing tools? Not by any rational standard.
And now that you have different data sources, at least EDW and a NoSQL store plus external data sources, the concept of data blending becomes much more important. Fortunately there is a whole subset of the Big Data industry devoted to making blending faster, easier, more accurate, and more repeatable than ever before.
(There’s a one-size fits all solution for Big Data. We need to jump into Big Data with both feet because everyone’s ahead of us. There’s just one path to value from Big Data.)
Not even close. As with all technologies companies need to start with a business strategy that answers the question, what is it that is most important to know and how will we use it once we know it. From this you develop a unique set of Big Data and little data technology requirements, gap them, and build an implementation plan that is strategically and financially rational.
At one level all NoSQL Big Data stores are very similar. But once you start looking at the details you’ll find a significant amount of variation in capabilities and tools that will impact your decisions. There are actually quite a number of alternate architectures depending on the size of the data, how the unstructured and semi-structured data is to be stored and retrieved (typically JSON or XML), and does your structure need to support streaming data (increasingly true) or simply batch.
There are a great many companies looking at Big Data and advanced analytics. By some counts as many as 75%. But actual implementation is heaviest in the Global 1000 group and still surprisingly light in the SMB market, perhaps only 15% to 20% actual adoption.
The message is there is still time to gain competitive advantage in your marketplace. But you should definitely take an informed look at your direct competitors and see if they are among the early adopters.
(Big Data is Only for the IT Department. Big data is just for the folks in IT. They’re the ones who get it first, and the only ones who know how to deal with it, right? Big data is just technology.)
You know who is most likely to be clamoring for Big Data? Not IT. Most likely it’s sales, marketing, pricing, logistics, and production forecasting. All areas that tend to reap outsize rewards from better forward views of the business.
Big Data is the new Enterprise IT and needs support and visibility from top executive leadership. Paul Sonderegger, Chief Data Strategist at Oracle says, “Many senior executives labor under the misconception that big data is a project or a new silo in enterprise IT. In fact, big data is the new enterprise IT. The phenomenon behind big data is the datafication of everything – the capture and use of more data in more daily activities. Every activity in personal, private, and public life is being digititzed and datafied. Every aspect of enterprise computing has to expand to keep pace. You can’t install a Hadoop cluster and expect to get big returns any more than you could add 100 pounds of muscle to a high-schooler and expect to get a Division I linebacker.”
Big Data is not just a technology issue, it’s a business issue.
(You absolutely, positively need Hadoop. Big data is about Hadoop.)
Just a few years back this may have been a myth but no longer. Hadoop has won. Big Data is almost entirely about storage on Hadoop. During the early adolescence of NoSQL from about 2006 to 2012 there were a number of proprietary NoSQL formats that worked in a similar manner to Hadoop and were promoted by the majors such as Microsoft and Oracle, and even some successful independents like MongoDB. Ultimately however the power of having Hadoop as an open source Apache project has won out. All the previously non-conforming versions of NoSQL have moved rapidly to become conformant or compatible with Hadoop. This is a good thing.
(You absolutely, positively need data scientists. Big data is reserved for data scientists. Data Scientists drive Big Data.)
This statement requires a more nuanced answer. First, Big Data is a catch phrase that ought to be ‘Big Data and Advanced Analytics’ because it is the advanced analytics that extract the value from Big Data that become operational changes and competitive advantage. So you need someone to perform the advanced analytics.
We most frequently associate the title ‘data scientist’ with the person who performs advanced analytics. The challenge is that there is no standardized agreement about the skills a person must have to carry that title or even to differentiate among senior and junior practitioners. The important thing to keep in mind here is that when we speak of advanced analytics we are talking about “predictive analytics” (what will or should happen in the future), not “descriptive analytics” (the standard fare for analysts producing reports from your EDW that reflect past sales, production, profits, etc.).
Data science is the body of knowledge that teaches us how to conduct advanced predictive analytics. The techniques range from predictive modeling and forecasting, to recommenders, natural language processing and IoT, to deep learning including text, speech, and image recognition, including all the elements of AI now being developed.
You are not going to turn your existing data analysts loose on Big Data and expect them to perform these tasks. In this sense you do indeed need data scientists.
What level of skills you will need and whether you must hire the hard-to-find and expensive data scientist, or whether you can to some extent grow your own is a topic of intense conversation today. The toolset for advanced predictive analytics is also becoming increasing automated allowing less experienced data scientists to operate at the skill level of much more senior data scientist just a few years ago. You are going to need data science skills and it’s likely that person will be called a data scientist. How you find them or grow them is an open topic.
(Imperfect data quality must mean that Big Data is worthless. Big Data benefits are marred by “bad” data. There's so much data, little flaws don't matter.)
It is true that some unstructured and semi-structured data, and also streaming data can be noisy, messy, missing, or flawed.
To some extent the volume of data will reduce the impact of individual data flaws but there are more flaws now because there is more data. Gartner says, "Therefore, the overall impact of poor-quality data on the whole dataset remains the same. In addition, much of the data that organizations use in a big-data context comes from outside, or is of unknown structure and origin. This means that the likelihood of data quality issues is even higher than before. So data quality is actually more important in the world of big data."
There are analytic techniques to solve these problems allowing the extraction of valuable insights from even less than perfect data. Second, once Big Data sources have been identified as important then companies should also apply the standard methods and procedures of data quality, management, and governance to improve these flaws. You do need to be aware of data quality levels and follow the core principles of data quality assurance. These are not reasons to abandon Big Data any more than similar issues with your structured data.
Even before Big Data social scientists and ethicists were raising questions about whether in revealing so much about our detailed lives we were giving away more privacy than might be good either personally or for society as a whole. Certainly the detail contained in Big Data may amplify these concerns.
In some industries such as healthcare (under HIPPA) or financial services there are real regulations that require both technological and process protections to ensure they are met. Outside of the realm of regulation you may have your corporate conscience to address.
Still the benefits of Big Data to our society have been enormous. Big Data has been the core technology underlying huge advancements in medical science, and time and dollar savings in travel (auto navigation), and even in shopping (trips and time avoided and better selection achieved via ecommerce). The list goes on. It’s fair to say the Genie isn’t going back in the bottle.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.