Subscribe to Dr. Granville's Weekly Digest

The tech marketing hype machine shouting "Big Data" has drowned out the fact that more actionable, valuable insights are likely to be found in small versus large data sets. There are a number of reasons for this phenomenon, but a major reason is the curse of big data. "Big data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.
The curse of big data is described by Vincent Granville here. Put simply, you will find more "statistically significant" relationships in larger data sets.  "Statistically significant" means a statistical assessment of whether observations reflect a pattern rather than just chance and may or may not be meaningful. The larger the data set, the more "statistically significant" relationships will have no meaning - creating greater opportunity to mistake noise for signal.  "Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge. "Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).
So big data produces more correlations and patterns between data - yet also produces much more noise than signal. The number of false positives will rise significantly. In other words, more correlations without causation leading to an illusion of reality. 
"Correlation" means any of a broad class of statistical relationships involving dependence. "Spurious correlation" means a correlation between two variables that does not result from any direct relation between them but from their relation to other variables. "Causation" means the relationship between cause and effect backed by scientific evidence (e.g. relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first). "Correlation does not imply causation" is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other.
Yet humans are hardwired from evolution to see patterns. This is a necessary quality for survival in the jungle, but disserves us in many forms of abstract thinking - especially mistaking meaning from randomness in data. Put another way, mistaking noise for signal. 
Big data makes it harder to find the needle (actionable, valuable insights) in a larger and larger haystack. The danger is that we will increasingly be tricked by randomness found in big data and make bad decisions as a result believing noise is signal.
I suggest one good strategy to solve the "curse of big data" problem - in many (but not all) cases - is the intentional and purposeful break down of large data sets into smaller data sets. Creating smaller data sets from big data should be done strategically, not randomly. It is easier to analyze and test small data sets to differentiate signal from noise to extract meaning.
Beware of the curse of big data and avoid mistaking noise for signal. Small data is indeed very beautiful.

Views: 2405

Tags: Big, Causation, Correlation, Curse, Data, Noise, Signal, Small, of


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Walker on December 16, 2013 at 10:14am

Ken: No worries - please feel free to re-blog my posts. I respectfully ask that you include attribution with my name and links. Thank you for the kind words.

Comment by Ken Karnes on December 13, 2013 at 9:24am

Michael, on a recent Google search for information I found your blog... and I ultimately re-blogged it - wanted to let you know this in case you don't want it re-blogged - please just let me know.  I am enjoying reading all you posts - as a non-technical user of data! Thank you,

Comment by Fritz Broeze on April 15, 2013 at 9:03pm

 Agree. Big data runs the risk of creating models as large and complicated as the system they are modeling. Breaking it down into smaller pieces, thereby creating smaller, more manageable models, is a sensible solution. Caution should be taken to ensure that the assumptions made in breaking down the data are reasonable and continuously tested.

Follow Us

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service