Subscribe to DSC Newsletter

The "Big Data" marketing hype obscures the fact that more actionable, valuable insights are likely to be found in the right smaller "Smart Data" sets in contrast to large data sets.

While the term "Big Data" is properly defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time - the marketing hype promises technology to collect, store and crunch huge amounts of data to get value and provide advantage.

As many organizations are now learning, it is very difficult to get any value out of large data sets without clear goals, employing sophisticated data science techniques (e.g., machine learning and algorithms) and the right data crunching and analytical technologies. Getting value from data requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

While large data sets may provide great value in specific situations, the savvy professional data scientist knows the right combination and variety of "Smart Data" is usually more important than "Big Data" and is more likely to add significant value. One of the most important roles of data scientists is to select the appropriate variety of small data sets for a specific goal versus collecting and storing huge volumes of data.

There are a number of reasons for prioritizing the selection and collection of the right data and different varieties of "Smart Data" over "Big Data". One major reason is the curse of big data. Simply put, you will find more "statistically significant" relationships in larger data sets. "Statistically significant" means a statistical assessment of whether observations reflect a pattern rather than just chance and may or may not be meaningful. The larger the data set, the more "statistically significant" relationships will have no meaning - creating greater opportunity to mistake noise for signal.

Thus, "Big Data" produces more correlations and patterns between data - yet also produces much more noise than signal. The number of false positives will rise significantly. In other words, more correlations without causation leading to an illusion of reality.

Big data makes it harder to find the needle (actionable, valuable insights) in a larger and larger haystack. The danger is that we will increasingly be tricked by randomness found in big data and make bad decisions as a result believing noise is signal.

I suggest valuing the right "Smart Data" over "Big Data" and focusing on carefully selecting a variety of data sets relevant to a specific goal to maximize the probability of obtaining meaning and value from data.

See: http://bit.ly/1ow9EVF

Views: 3166

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Kremliovsky on September 22, 2014 at 7:59am

Michael, excellent term. Very well said. I was looking to shape it into something very compact and sound. You did. 

Comment by Ralph Winters on September 21, 2014 at 10:43am

If we had the same level of tools for "Big Data" as we did for "Small Data", We could do better.   But you are correct, "Big Data" is forcing us into a "Forest beyond the Trees" type of scenario, in which the data we are analyzing has no context.  Only when it is pared down into something manageable can we begin to do meaningful work.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service