
The tech marketing hype machine shouting "Big Data" has drowned out the fact that more actionable, valuable insights are likely to be found in small versus large data sets. There are a number of reasons for this phenomenon, but a major reason is the curse of big data. "Big data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.
.
The curse of big data is described by Vincent Granville here. Put simply, you will find more "statistically significant" relationships in larger data sets. "Statistically significant" means a statistical assessment of whether observations reflect a pattern rather than just chance and may or may not be meaningful. The larger the data set, the more "statistically significant" relationships will have no meaning - creating greater opportunity to mistake noise for signal. "Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge. "Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).
.
So big data produces more correlations and patterns between data - yet also produces much more noise than signal. The number of false positives will rise significantly. In other words, more correlations without causation leading to an illusion of reality.
.
"Correlation" means any of a broad class of statistical relationships involving dependence. "Spurious correlation" means a correlation between two variables that does not result from any direct relation between them but from their relation to other variables. "Causation" means the relationship between cause and effect backed by scientific evidence (e.g. relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first). "Correlation does not imply causation" is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other.
.
Yet humans are hardwired from evolution to see patterns. This is a necessary quality for survival in the jungle, but disserves us in many forms of abstract thinking - especially mistaking meaning from randomness in data. Put another way, mistaking noise for signal.
.
Big data makes it harder to find the needle (actionable, valuable insights) in a larger and larger haystack. The danger is that we will increasingly be tricked by randomness found in big data and make bad decisions as a result believing noise is signal.
.
I suggest one good strategy to solve the "curse of big data" problem - in many (but not all) cases - is the intentional and purposeful break down of large data sets into smaller data sets. Creating smaller data sets from big data should be done strategically, not randomly. It is easier to analyze and test small data sets to differentiate signal from noise to extract meaning.
.
Beware of the curse of big data and avoid mistaking noise for signal. Small data is indeed very beautiful.
.
You need to be a member of Data Science Central to add comments!
Join Data Science Central