What I Always Wanted To Know About Big Data* (*but was afraid to ask)

When I first heard the term Big Data few years ago, I didn’t think much of it. Soon after, Big Data started appearing in many of my conversations with many of my tech friends. So I started asking a very simple question ‘What is Big Data?’. I kept asking that question to various folks and I did not get the same answer twice from any number of people. ‘Oh, it’s lot of data’. ‘It’s variety of data’. ‘It’s how fast the data is piling up’. Really? I thought to myself but was afraid to ask more questions. As none of it made much sense to me, I decided to dig into it myself. Obviously, my first stop was Google.

When I typed ‘Big Data’ at that time, this showed up.

Ahh, It all made sense right away. None of the people I was talking to really knew much about Big Data but were talking about it anyway as everyone else was talking about it.

So what Really is Big Data?

I turned to my trusted old friend Wikipedia and it said:

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing application.

So Wikipedia’s definition is focusing on ‘volume of data’ and ‘complexity of processing that data’. Good start but doesn’t clearly answer the question of what is the volume threshold of data that makes it Big Data. Is it 100 GB? A Peta Byte? What are the on-hand database management tools? Are they referring to Relational Database Systems from Oracle and IBM etc? Most likely but not answered in this definition.

Then I turned to O’Reilly Media who everyone told me that they are the ones who made Big Data popular. As per O’Reilly media:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

To some extent, Wikipedia and O’Reilly’s definitions are similar in that both refer to ‘processing capacity’ and ‘conventional database systems’ but O’Reilly media adds a new twist by mentioning ‘too big’ and ‘moves fast’. Hmm, animals like elephants and cheetahs started running through my head.

My next stop was Doug Laney from Gartner who was credited with the 3 ‘V’s of Big Data. Gartner’s Big Data is:

High-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

Gartner is referring to the size of data (large volume), speed with which the data is being generated (velocity), and the different types of data (variety) and this seemed to align with the combined definition of Wikipedia and O’Reilly media.

I thought I was getting somewhere.

Not so fast said Mike Gualtieri of Forrester, who said that the 3 ‘V’s mentioned by Gartner are just measures of data and Mike insisted that Forrester’s definition is more actionable. And that definition is:

Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.

Let us try to digest this together. Forrester seems to be saying that any data that is beyond the current reach (i.e. frontier) of that firm to store (i.e. large volumes of data), process (i.e. needs innovative processing), and access (new ways of accessing that data) is the Big Data. So the question is: What is the ‘frontier’? Who defines the frontier?

I kept searching for those answers. I looked at McKinsey’s definition:

Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

Well, similar to all the above but still not specific for me to decide when the data becomes Big Data.

Then I came across this article from University of Wisconsin which gave some specificity to the ‘Volume’. The article said ‘some have defined big data as an amount that exceeds a petabyte – one million gigabytes.’ Thank you Wisconsin for clearing that up. But the same article mentioned the numerous ‘V’s that are being added to the 3 ‘V’s that Gartner originally came up with.

IBM added ‘Veracity’ referring to the quality of data. See the picture below with 4 ‘V’s.

And other ‘V’s kept getting added to Big Data definition by other people …

Variability is referring to the changing nature of data.

Visualization refers to the art of turning data science into visual stories via graphs and other charts to transform data into information, into insight, into knowledge etc.

Value refers to the fact that businesses need to take advantage of all this data into some valuable decisions.

So what did I learn?

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on. And they are:

Big Data is data that is of large volume (> 1 Peta bytes)
Big Data is data that is not a single type i.e. structured and a variety of structured, unstructured etc.
Big Data is data that is being generated at a much faster rate than data in the past from all kinds of sources including social media.
Big Data is data that requires newer ways to store, process, analyze, visualize, and integrate.

Hope this article was helpful. If it was, feel free to ‘Follow’ me or Connect on LinkedIn or or follow on twitter @rkdontha1.

Sources (Please click on the link for article):

Wikipedia

O’Reilly Media:

Gartner’s Doug Laney:

Forrester’s Mike Gualtieri

McKinsey definition and University of Wisconsin article