Subscribe to DSC Newsletter

How Many "V's" in Big Data? The Characteristics that Define Big Data

 Summary:  We’ve scoured the literature to bring you a complete listing of possible definitions of Big Data with the goal of being able to determine what’s a Big Data opportunity and what’s not.  Our conclusion is that Volume, Variety, and Velocity still make the best definitions but none of these stand on their own in identifying Big Data from not-so-big-data.  Understanding these characteristics will help you analyze whether an opportunity calls for a Big Data solution but the key is to understand that this is really about breakthrough changes in the technology of storing, retrieving, and analyzing data and then finding the opportunities that can best take advantage.

 

What Business Users Want to Know

Conversations with business users invariably start with the question “what is Big Data”.  Implicit in the question is that if it can be defined then they can understand where it currently exists; where the opportunities to be exploited may lie, and when and how will the business user need to deal with this.

Sounds simple enough, but as we observed in a prior posting there are many different characteristics of Big Data on which data scientists agree, but none which by themselves can be used to say that this example is Big Data and that one is not.  Happily, almost everyone who has weighed in on this conversation has chosen descriptors that begin with “V”, hence the name of this article.  Most common you will hear Volume, Variety, and Velocity.  These may be the most common but by no means the only descriptors that have been used.

You would think this would be settled by now but a scan of the literature says otherwise. In fact we were able to find eight, count them eight different characteristics claimed for Big Data. 

Volume

Volume always seems to head each list.  There is general agreement that if volume is in the gigabytes it is probably not Big Data, but at the terabyte and petabyte level and beyond it may very well be.  Volume is a key contributor to the problem of why traditional relational database management systems (RDBMS, data warehouses as we know them today) fail to handle Big Data.  Underlying that failure are more complex issues of cost, reliability, long query times, and their inability to handle new sources of unstructured or semi-structured data like text.

Big companies are no strangers to Big Data.  As early as the 1980’s UPS began to capture and track data on package movements that now number 16.3 million packages per day while responding to 39.5 million tracking requests per day, now storing over 16 petabytes of data.[i]  Wal-Mart records more than 1 million customer transactions per hour, generating more than 2.5 petabytes of data.[ii]  And in one survey 17% of companies report currently managing more than a petabyte of data with an additional 22% reporting hundreds of terabytes.[iii]

So if close to 40% companies report already managing terabytes of data or more what’s changed?  What’s changed is the desire to unleash the knowledge contained in transactional stores and external data sources through analysis, and when that happens the new NoSQL storage and retrieval architectures and tools become important.

Variety:

Different Types:  Variety describes different formats of data that do not lend themselves to storage in structured relational database systems.  These include a long list of data such as documents, emails, social media text messages, video, still images, audio, graphs, and the output from all types of machine-generated data from sensors, devices, RFID tags, machine logs, cell phone GPS signals, DNA analysis devices, and more.  This type of data is characterized as unstructured or semi-structured and has existed all along.  In fact it’s estimated by some studies to account for 90% or more of the data in organizations. 

Different Sources: Variety is also used to mean data from many different sources, both inside and outside of the company.  What’s changed is the realization that through analysis it can yield new and valuable insights not previously available.

There are two primary challenges here.  First, storing and retrieving these data types quickly and cost efficiently.  Second, during analysis, blending or aligning data types from different sources so that all types of data describing a single event can be extracted and analyzed together.

Then there is the interaction of variety with volume.  Unstructured data is growing much more rapidly than structured data.  Gartner estimates that unstructured data doubles every three months and offers the example that there are seven million web pages added each day.

In terms of opportunity, Variety is seen by business users as the major focus of new Big Data initiatives.  Companies have been handling large volumes of data for many years and view that process as incremental and business and usual.  But the new and unique opportunity to add unstructured data to the analytic mix is seen by many as a game changer.[iv]

Velocity

Data-In-Motion:  Data scientists like to talk about data-at-rest and data-in-motion.  One meaning of Velocity is to describe data-in-motion, for example, the stream of readings taken from a sensor or the web log history of page visits and clicks by each visitor to a web site.  This can be thought of as a fire hose of incoming data that needs to be captured, stored, and analyzed.  Consistency and completeness of fast moving streams of data are one concern.  Matching them to specific outcome events, a challenge raised under Variety is another. Velocity also incorporates the characteristics of timeliness or latency – is the data being captured at a rate or with a lag time that makes it useful.

Lifetime of Data Utility: A second dimension of Velocity is how long the data will be valuable.  Is it permanently valuable or does it rapidly age and lose its meaning and importance.  Understanding this dimension of Velocity in the data you choose to store will be important in discarding data that is no longer meaningful and in fact may mislead.

Real Time Big Data Analytics: The third dimension of Velocity is the speed with which it must be stored and retrieved.  This is one of the major determinants of NoSQL storage, retrieval, analysis, and deployment architecture that companies must work through today.  When you visit a sophisticated content web site such as Yahoo or the Huffington Post, those ads that pop up have been selected specifically for you based on the capture, storage, and analysis of your current web visit, your prior web site visits, and a mash up of external data stored in a NoSQL DB like Hadoop and added to the analytics.  When you sign on to Amazon or Netflix and see recommended purchases or views just for you the same process has taken place.  The architecture of capture, analysis, and deployment must support real-time turnaround (in this case fractions of a second) and must do this consistently over thousands of new visitors each minute.  Real Time Big Data Analytics (RTBDA) is one of the main frontiers of development in Big Data today.

What’s changed?  The data was always there but the ability to capture, analyze, and act on it in (near) real time is indeed a brand new feature of Big Data technology.

Value

Although Value is frequently shown as the fourth leg of the Big Data stool, Value does not differentiate Big Data from not so big data.  It is equally true of both big and little data that if we are making the effort to store and analyze it then it must be perceived to have value.

Big Data however is perceived as having incremental value to the organization and many users quote having found actionable relationships in Big Data stores that they could not find in small stores.  Certainly it is true that if in the past we were storing data about groups of customers and are now storing data about each customer individually then the granularity of our findings is much finer and we approach that desired end-goal of offering each customer a personalization-of-one in their experience with us.

Another take on Value is that Big Data tends to have low value density, meaning that you have to store a lot of it to extract findings.[v]  This is likely true but since new Big Data storage and retrieval technologies are so much less expensive than previous, low value density should not be a hurdle that prevents us from searching for those valuable kernels.

Finally, there is at least one reviewer who goes to philosophical extremes quoting Sartre “existence precedes essence”.  By which he means that we may choose to store Big Data before even understanding exactly what use we have for it.[vi]  We’re not entirely sure about this.  We still encourage business users to work backwards from the desired outcome before deciding exactly what Big Data to capture.

There are at least four additional characteristics that pop up in the literature from time to time.  All of these share the same definitional problems of Value.  That is they may be a descriptor of data but not uniquely of Big Data.

Veracity:  What is the provenance of the data?  Does it come from a reliable source?  It is accurate and by extension, complete.

Variability: There are several potential meanings for Variability.  Is the data consistent in terms of availability or interval of reporting?  Does it accurately portray the event reported?  When data contains many extreme values it presents a statistical problem to determine what to do with these ‘outlier’ values and whether they contain a new and important signal or are just noisy data.

Viscosity:  This term is sometimes used to describe the latency or lag time in the data relative to the event being described.  We found that this is just as easily understood as an element of Velocity.

Virality:  Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events.

I’ve been working with the US Department of Commerce National Institute for Standards and Technology (NIST) working group developing a standardized "Big Data Roadmap" since the summer of 2013.  Reaching a common definition of Big Data was one of the first tasks we tackled.  Those grand qualifiers from our college philosophy classes, ‘is the characteristic BOTH necessary and sufficient’ turns out to be extremely useful.

In fact, we elected to stick with Volume, Variety, and Velocity and kicked the last five out of the Big Data definition as broadly applicable to all types of data.  Unfortunately, as you may know if you’ve grappled with explaining this yourself, Volume, Variety, and Velocity do pass the necessary and sufficient test but not all Big Data opportunities demonstrate all three characteristics.  One suggestion was to call it Big Data if it met two out of three but even that didn’t completely pass muster.

Variety comes close when speaking narrowly of unstructured data since storage and retrieval techniques for these data types has really been revolutionized by new NoSQL tools and techniques including blending these with traditional structured data.  Likewise, Velocity comes close when talking about Real Time Big Data Analytics for the same reason. 

We argued in a previous post that Big Data is not so much about the data itself as it is about a whole new NoSQL / NewSQL technology .  Big Data is about this new set of tools and techniques in search of appropriate problems to solve.  Each business application may be different and it is growing apparent that real solutions in real companies are frequently hybrids of NoSQL and traditional RDBMS and analytic tools.  These definitions may help sort down opportunities at a high level, but before proceeding, each opportunity needs to be carefully analyzed for realistic business value and realistic technology applications.

About the author:  Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

The original blog can be viewed at:

http://data-magnum.com/how-many-vs-in-big-data-the-characteristics-...



[i] [i] Big Data in Big Companies, Thomas H. Davenport and Jill Dyche, SAS Institute Inc. May 2013

[ii] Open Data Center Alliance: Big Data Consumer Guide

[iii] 2013 Big Data Opportunities Survey, By Joseph McKendrick, Research Analyst, Produced by Unisphere Research, a Division of Information Today, Inc., May 2013, Sponsored by SAP

[iv] Big Data in Big Companies, Thomas H. Davenport and Jill Dyche, SAS Institute Inc. May 2013

[v] Big Data: Business Opportunities, Requirements and Oracle’s Approach, Richard Winter, December 2011

[vi] Big Data, George O. Strawn, NITRD.gov, NITRO Big Data.pdf April 2013

 

 

 

Views: 13601

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service