Subscribe to DSC Newsletter

Throw them all out I say.  Big Data is really just defined by one letter, D, Dimensionality.  

(Or, at most, 2 letters: BD. Big Dimensionality.)

To use 3 (V)s, when 1 (D) is sufficient, is truly unforgivable.

Barty Crouch Jr. (as Alastor Moody): "But first, which of you can tell me how many Unforgivable Curses there are? "Hermione: "Three, sir. "Barty Crouch Jr. (as Alastor Moody): "And they are so named? "Hermione: "Because they are unforgivable. The use of any one of them will [...]  "Barty Crouch Jr. (as Alastor Moody): "Earn you a one-way ticket to Azkaban. Correct. The Ministry says you are too young to see what these curses do. I say different! You need to know what you're up against. You need to be prepared [...] "


In the world of Predictive Modeling and Big Data, there are two curses that really stand out.

1) The Curse of Dimensionality

Coined by that original professor of the Dark Arts, Robert E. Bellman, in his work on dynamic optimization. It refers to the fact that as dimensionality increases we see a problem of data sparsity.  I frequently run into this problem when trying to build models using traditional data sources.  The challenge there is that the majority of business processes/practices involve human discretion and judgment against a limited set of actions, leading to decision makers repeatedly "doing what they've always done."  In an experiment, you would try to cast a wider net to determine how things behave under different circumstances.  Due to the homogeneity intrinsic to traditional conservative decision-making approaches, large parts of these problem domains remain not well understood (under-sampled). This can derail efforts to develop robust statistical models.

2)  The Curse of Big Data

As Vincent points out, leads to spurious correlations.  In other words, the problem of seeing things that are not real, and missing things that are real. Ghosts. Hallucinations. Other magical things.

Views: 536


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on February 16, 2015 at 10:01am

The curse of dimensionality is a rather old statistical concept related to the sparsity of data (even big data) in high dimensions, and thus the failure of methods based e.g. on nearest neighbors, since neighbors are so far away and isolated (in high dimensions), that they are irrelevant.

The new problem is the curse of big data, caused by so many variables and cross-correlations, that many are bound to be spurious, overshadowing true but undetected correlations. In short, the signal is buried under tons of noise that somehow seems to exhibit patterns. The fix is described here.


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service