Subscribe to DSC Newsletter

Big Data or Not Big Data: What is <your> question?

Before jumping on the Big Data bandwagon, I think it is important to ask  the question of whether the problem you have requires much data.  That is, I think its important to determine when Big Data is relevant to the problem at hand.

The question of relevancy is important, for two reasons: (i) if the data are irrelevant, you can't draw appropriate conclusions (collecting more of the wrong data leads absolutely nowhere), (ii) the mismatch between the problem statement, the underlying process of interest, and the data in question is critical to understand if you are going to distill any great truths from your data.

Big Data is relevant when you see some evidence of a non-linear or non-stationary generative process that varies with time (or at least, sampling time), on the spectrum of random drift to full blown chaotic behavior.  Non-stationary behaviors can arise from complex (often 'hidden') interactions within the underlying process generating your observable data.  If you observe non-linear relationships, with underlying stationarity, it reduces to a sampling issue.  Big Data implicitly becomes relevant when we are dealing with processes embedded in a high dimensional context (i.e., what's left after dimension reduction).  With higher embedding dimensions, we need more and more well distributed samples to understand the underlying process.  For problems where the underlying process is both linear and stationary, we don't necessarily need much data at all.

Note: The size of the circles do not reflect the frequency of observing any particular type of data. Complex (small, nonlinear, nonstationary) but under-sampled data are not rare.  However, for complex processes, you need more samples to capture the underlying variability and higher order statistical structure, so the "need" for big data is greater.  Whether you actually "have" sufficient data is a different issue.  Likewise, for a simple linear stationary process, you need very little data.

The wrench here is in knowing when you are dealing with a non-linear or non-stationary process.  So, a little thoughtfulness and discovery work can tell you whether you have a Big Data problem, and then you can go about finding ways of actually collecting the required Big Data.  Knowing whether you have a Big Data problem (or not) informs your approach for actually learning from the data.

This is an especially important question for Data Architects/Strategists to think about when building their roadmaps against the kinds of challenges they hope to tackle.  Not all paths to good Data Strategy lead to Hadoop.  I think a good guiding principle of design is to "design with the end user in mind."  In this context, the end "user" is actually the algorithm learning from the data.  In principle it is empowering to believe that one can learn anything from an infinite box of all kinds of data (let's call it a 'universe box'), in practice, you will want to reduce the data to the essential subset that lets you do something meaningful with it.  Just because an algorithm can learn anything, does not mean it should.

There's a really intuitive paper on Generative vs. Discriminative classifiers by Ng and Jordan that has stayed with me since grad school.

Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A comparison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng, A.Y., and Jordan, M. (2002). 

I like this paper for the intuition it provides.

Aside from their performance characteristics, a preference for discriminative vs. generative models to some extent reveals one's beliefs, and must inform one's design choices. 

Models can reveal something about ourselves, i.e., our sense of existentialism.  I would think an engineer is more apt to believe that there is a truth, that it has a mechanistic basis, and that we can identify the model (see Control theory).  Key word: Identify.  Statistician's infer.  Key word: Infer.  This almost suggests (at least in my mind) that somewhere in their heart of hearts, they don't believe that an exact (perhaps, noisy) mechanism actually exists. That humans just like to see mechanistic explanations where there are none to be found.  I think they are more comfortable with randomness, and even embrace it.  Would an engineer be comfortable saying, for instance, that "all models are wrong?" I think they have to believe that some models are actually right.

Some tangents to think about.

Views: 2423


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 2:50am

You are still asking a question: "Given my data, what are the universe of possible questions that may be asked?"  A hypothesis generator.

Comment by Hassine Saidane on February 18, 2015 at 1:15am

Interesting  and great to know. Another question though, is what if I don't have a question? I am just looking for a hidden treasure, thus my question is where is the treasure, the gold nugget? Or what if I want the data to ask and answer the question? Now is it Big Data or No Big Data? What models are appropriate to this quest?

Thanks for the insights.


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service