Subscribe to DSC Newsletter

As I studied the subject, the following three terms stood out in relation to Big Data.

Variety, Velocity and Volume.

In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.


I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.

The plot above, using three axes helps to visualize the concept.

Data Volume:
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.

Large Synoptic Survey Telescope (LSST).
http://lsst.org/lsst/google
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”

https://www.youtube.com/t/press_statistics/?hl=en
72 hours of video are uploaded to YouTube every minute

There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”
http://en.wikipedia.org/wiki/Parkinson’s_law

This is no longer true since the data being generated will soon exceed all available storage space.
http://www.economist.com/node/15557443

Data Velocity:
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.

http://blog.twitter.com/2011/03/numbers.html
140 million tweets per day on average.( more in 2012)

I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.

Data Variety:
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.

Google uses smart phones as sensors to determine traffic conditions.

http://www.wired.com/autopia/2011/03/cell-phone-networks-and-the-fu...
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.

The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.

 

So what is one to do about this?

So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)

In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.
http://www.kloudpedia.com/2012/01/10/hadoop/

As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.

One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.

For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.

IBM has a 100TB unit for demonstration purposes.
http://www.ibm.com/solutions/sap/us/en/landing/hana.html

Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.

Views: 62593

Reply to This

Replies to This Discussion

The most important V is Value.

Vincent:

I disagree with you.  Value is not something that is inherent in data, but must be extracted from it -- if it is there.  The Large Hadron Collider generates TBs of data but most of it is thrown out, in favor of the pieces that are the real "signal." 

The analytics applied to that larger dataset told them where the value lies -- it needed to be extracted.

Of the remaining three Vs, the importance depends upon what your situation is.

Finally, all this talk of Hadoop and hardware belies the fact that while you need new software and hardware to house and access Big Data, there is nothing like a good analyst to figure out the value.  Most adults I know can drive a car, but darn few can race in the Indy 500.

Steve,

I can't agree more - the value needs to be derived - and as you rightly say, if it is there at all.

In the case of most "general" or "average" organisations - complexity (represented for example by variety, variability and veracity, if you have to drag the Vs in) is a much bigger issue than raw volume and velocity. Your "average" insurer or retailer does not deal with data on the scale of a Yahoo or EBay or LinkedIn does - those are very special data-driven organisations.  I'll go so far as to say the average organization doesn't even need special Big Data technology to get some analytical value from their unstructured or social media data. Just some clever thinking and wise utilization of what's to their disposal - definitely to start off anyway.

IBM (and others) is either confused about what big data is, or is merely avoiding giving Gartner credit for the original 3Vs from 12 years ago. "Veracity" is certainly not a definitional characteristic of Big Data. In fact, it is inversely related to my original 3Vs. Increases of volume, variety and velocity tend to reduce the overall veracity of the data. As far as "visualization" goes, that's a use-case, not a characteristic. --Doug Laney, VP Research, Gartner, @doug_laney 

Michael Malak said:

IBM defines the fourth V to be "veracity".  I personally like to defien the fifth V to be "visualization".

Well said Steve. Value is most certainly *not* a defining characteristic of Big Data. In fact as data's volume, velocity and variety increase, the potential or realized value of any instance of data actually decreases. The data set's overall value may (or may not) increase -- it's mutually exclusive to the original 3Vs I defined in 2001. --Doug Laney, VP Research, Gartner, @doug_laney 

Steve said:

Vincent:

I disagree with you.  Value is not something that is inherent in data, but must be extracted from it -- if it is there.  The Large Hadron Collider generates TBs of data but most of it is thrown out, in favor of the pieces that are the real "signal." 

The analytics applied to that larger dataset told them where the value lies -- it needed to be extracted.

Of the remaining three Vs, the importance depends upon what your situation is.

Finally, all this talk of Hadoop and hardware belies the fact that while you need new software and hardware to house and access Big Data, there is nothing like a good analyst to figure out the value.  Most adults I know can drive a car, but darn few can race in the Indy 500.

Over the past few years, I have seen many attempts to add more Vs.

I ran into the best candidate this weekend: V for Value.

Big V of Big Data – Big Value

http://paper.li/rajivmah/1434788984

I totally agree. It just states the obvious, we are all doing big data since we are trying to extract the value hidden within all those data streams.

In this write up, I referred to it as dark data

https://www.linkedin.com/pulse/dark-data-fuel-warp-speed-growth-iot...

The value is locked inside until someone figures out how to release it by any means.

After having implemented several data lakes with Hadoop and HBase, I have completely soured on Hadoop. What appears to be a lovely abstraction to address a growing set of unstructured information, the batch attributes of the approach make it difficult to integrate properly into high performance business processes. Secondly, the inefficiency of Hadoop becomes quite a distraction for both software development as well as operational management. Thirdly, the complexity of the run-time is fighting with what the infrastructure can do better, certainly for a divide-and-conquer approach like MR. Fourth, most teams now use a SQL interface to get access to the data, but Hive isn't SQL, and if it ever does become SQL I doubt that it will be competitive with a real SQL engine designed for that purpose. Fifth, our go-to target these days for adding value from the data in the data lake are knowledge graphs of some kind, and we are trying to find scalable mechanisms to link those together. Hadoop and HDFS are not helpful here. And finally, sixth, parallel file systems with different levels of erasure coding are deliving more performance, in a more flexible way, at a lower cost than a Hadoop cluster. 

Looking at the power of horizontally scalable SQL engines, such as MemSQL, VoltDB, and NuoDB, that offer transactions, ACID, dail-in performance levels, you gain application development efficiency AND operational efficiency as the number of moving parts become dramatically smaller.

We are done with Hadoop.

Oh, and the fifth McKinsey V, value, is key to any business process that funds the big data effort, so I would say it is the fundamental attribute of Big Data, as all big data efforts can only exist if they have found a proper Value to pursue. 

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service