As I studied the subject, the following three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.
I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.
The plot above, using three axes helps to visualize the concept.
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.
Large Synoptic Survey Telescope (LSST).
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”
72 hours of video are uploaded to YouTube every minute
There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”
This is no longer true since the data being generated will soon exceed all available storage space.
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
140 million tweets per day on average.( more in 2012)
I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.
Google uses smart phones as sensors to determine traffic conditions.
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.
The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.
So what is one to do about this?
So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)
In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.
As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.
One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.
For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.
IBM has a 100TB unit for demonstration purposes.
Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.
There is another V of Big Data that statisticians care about: Variability.
Big Data, because it can cover the full range of human (and machine) experience, almost always displays more variance than smaller datasets.
While the H&H boys (hardware & Hadoop) are focused on the 3Vs of Big Data processing, the Data Scientist tries to explain the Variability in Big Data. The problem is that many algorithms that are focused on variability do not scale to Big Data sizes, either because of software inefficiency or hardware limitations.
I like 4Vs much better than 3Vs. (sounds better)
could you please possibly give an example to all regarding Variability?
something that we can relate to.
I found this definition at investopedia but an example will be a lot more helpful.
also, is Variability increasing as time progresses?
or is it just that with large sets of data that the treatment of Variability becomes hard?
as in is it one of those defining properties like the other three ?
The extent to which data points in a statistical distribution or data set diverge from the average or mean value. Variability also refers to the extent to which these data points differ from each other. There are four commonly used measures of variability: range, mean, variance and standard deviation.
A quick scan of the web shows that there is a consensus for the 3Vs as units for the axes and a consensus that 4Vs sounds better but the jury is out on the choice between Complexity vs. Variability.
so it will be either 4Vs or 3Vs & C
any one with a real life example of how data complexity is changing please step forward; thanks.
The long tail is much longer when you have Big Data, Thus bigger datasets will have more interesting and outlying cases than will a dataset which is a sample from Big Data..
When you quantify and explain variability, you are helping the consumer of the information distinguish between a random rare event and a true finding that demands attention. If all you observe are rare random events that cannot be explained, and yet you try to concoct an explanation (as most humans try to do), you are suffering from apophenia. Apophenia is the experience of seeing meaningful patterns or connections in random or meaningless data.
One important job of the Data Scientist in the age of Big Data is to help distinguish between apophenia and meaningful phenomena.
I found articles from Nov 2011 where
"Forrester added variability. McKinsey Global Institute next threw in value as a fifth ‘V’ descriptor."
Other people throw in 12 dimensions. At that point there is a mix up between what defines big data and what one can derive out of analyzing it.
Variability is logical to add since in big data there will be more of a collection of random points from various sources as opposed to previous data sets that came from one coherent structure from one source.
I would argue against value as being a defining dimension. It is more of why one should be interested in the analysis.
The McKinsey study is here:
very educational for those interested.
The Economist also had a very good special report on the subject where they introduce terms such as “Data exhaust”
and "data-centred economy". They handle the definition of big data in their own way.
In this special report
-Data, data everywhere
-All too much
-A different game
-Clicking for gold
-The open society
-Needle in a haystack
-New rules for big data
-Handling the cornucopia
The next step was to create a list of companies that have created tools to manipulate and process big data but some one beat me to that.
These are companies that supply tools to:
One should figure out the required operation before starting the search for the company that offers the required tool(s).
Hi Diya, Great piece. For those interested in where the "3Vs" of Big Data originated, here's a link to the Gartner piece I wrote first defining them in 2001: 3-D Data Management: Controlling Data Volume, Velocity and Variety. Cool to see the marketplace finally adopt them, albeit 11 years later! :-) Since that time, Gartner has expanded on the definition (including 12 distinct dimensions). --Doug Laney, VP Research, Gartner, @doug_laney
IBM defines the fourth V to be "veracity". I personally like to defien the fifth V to be "visualization".