Subscribe to DSC Newsletter

Big Data or Not Big Data: What is <your> question?

Before jumping on the Big Data bandwagon, I think it is important to ask  the question of whether the problem you have requires much data.  That is, I think its important to determine when Big Data is relevant to the problem at hand.

The question of relevancy is important, for two reasons: (i) if the data are irrelevant, you can't draw appropriate conclusions (collecting more of the wrong data leads absolutely nowhere), (ii) the mismatch between the problem statement, the underlying process of interest, and the data in question is critical to understand if you are going to distill any great truths from your data.

Big Data is relevant when you see some evidence of a non-linear or non-stationary generative process that varies with time (or at least, sampling time), on the spectrum of random drift to full blown chaotic behavior.  Non-stationary behaviors can arise from complex (often 'hidden') interactions within the underlying process generating your observable data.  If you observe non-linear relationships, with underlying stationarity, it reduces to a sampling issue.  Big Data implicitly becomes relevant when we are dealing with processes embedded in a high dimensional context (i.e., what's left after dimension reduction).  With higher embedding dimensions, we need more and more well distributed samples to understand the underlying process.  For problems where the underlying process is both linear and stationary, we don't necessarily need much data at all.

Note: The size of the circles do not reflect the frequency of observing any particular type of data. Complex (small, nonlinear, nonstationary) but under-sampled data are not rare.  However, for complex processes, you need more samples to capture the underlying variability and higher order statistical structure, so the "need" for big data is greater.  Whether you actually "have" sufficient data is a different issue.  Likewise, for a simple linear stationary process, you need very little data.

The wrench here is in knowing when you are dealing with a non-linear or non-stationary process.  So, a little thoughtfulness and discovery work can tell you whether you have a Big Data problem, and then you can go about finding ways of actually collecting the required Big Data.  Knowing whether you have a Big Data problem (or not) informs your approach for actually learning from the data.

This is an especially important question for Data Architects/Strategists to think about when building their roadmaps against the kinds of challenges they hope to tackle.  Not all paths to good Data Strategy lead to Hadoop.  I think a good guiding principle of design is to "design with the end user in mind."  In this context, the end "user" is actually the algorithm learning from the data.  In principle it is empowering to believe that one can learn anything from an infinite box of all kinds of data (let's call it a 'universe box'), in practice, you will want to reduce the data to the essential subset that lets you do something meaningful with it.  Just because an algorithm can learn anything, does not mean it should.

There's a really intuitive paper on Generative vs. Discriminative classifiers by Ng and Jordan that has stayed with me since grad school.

Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A comparison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng, A.Y., and Jordan, M. (2002). 

I like this paper for the intuition it provides.

Aside from their performance characteristics, a preference for discriminative vs. generative models to some extent reveals one's beliefs, and must inform one's design choices. 

Models can reveal something about ourselves, i.e., our sense of existentialism.  I would think an engineer is more apt to believe that there is a truth, that it has a mechanistic basis, and that we can identify the model (see Control theory).  Key word: Identify.  Statistician's infer.  Key word: Infer.  This almost suggests (at least in my mind) that somewhere in their heart of hearts, they don't believe that an exact (perhaps, noisy) mechanism actually exists. That humans just like to see mechanistic explanations where there are none to be found.  I think they are more comfortable with randomness, and even embrace it.  Would an engineer be comfortable saying, for instance, that "all models are wrong?" I think they have to believe that some models are actually right.

Some tangents to think about.

Views: 2174


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Hassine Saidane on February 18, 2015 at 5:54am

Yes, but... Additional intelligence can be added to answer the "why" question by automatically explaining the discovered "unusual" patterns.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 5:42am

The trivial solution would be to automatically ask Why? after every step, which is what children do and adults eventually give up in frustration saying "I don't know!" But, when an informed adult asks the questions Why? it is usually after being presented with information contrary to their expectation (a value system), as opposed to trivially asking Why? just because they can.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 5:37am

Well technically, the question Why? is generated by you observing the result of the analysis.  The data themselves do not actually suggest a question.  At the end of an unsupervised process, you end up with some new set of reduced observations.  The algorithm wouldn't ask the question "Why?"  There is something about observing the data by a human observer which leads to the question "Why?"  That is where I was driving to.  So, you are right, unsupervised systems do indeed generate questions, but whether those are meaningful valuable questions can only be determined on the basis of some value system intrinsic to the observer.  To the unsupervised system, it is indifferent to the distinction between beer and diapers, it sees them as product A and product B.  To a human observer, there are contexts and values associated with beer and diapers, which begs the question "Why?"

Comment by Hassine Saidane on February 18, 2015 at 5:24am

Unsupervized data mining (association, clustering/segmentation) may gproduce insights that are "golden nuggets". For example, discovery of markets/segments where a particular products sells unusually high. Then the question generated by ths insight/nugget is why? Sociographics analysis will give the answer. ANother example: a product that often cells with other products, as in the classic case of: Beer and diapers on Thursdays. Why? The data can generate the question and the answer.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 4:30am

Even the acquisition of language by human beings comprises a problem in which as a baby you are exposed to all kinds of seeming jibberish, and over the course of time, your brain learns language from this jibberish.  The value system is functional, such that you learn once the majority of people understand what you are saying, and in turn, you can respond to their queries.  So, let's remove the feedback a moment, and ask the question, can I learn language, grammar, etc. without any feedback whatsoever?  The best way I could approach this would be to apply an unsupervised approach, but I would have to provide some metric that values the correlation (and higher order statistics) for a sequence of words.  So, any new sentences I would form could then be judged on the basis of sentences that I learned out of the jibberish, as to whether or not they had merit.  Would my sentences necessarily be meaningful?  I don't know -- have to think about that one.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 4:14am

I have to ask, do data exist if you are not there to observe them? ;-)  Is there a search, in the absence of a metric?

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 3:36am

If Data represents the answers, then the reduced basis are the set of questions that give rise to those answers.  In other words, the distribution of the data along the reduced basis, tells you something about the nature of the process you are observing.  But, to define what constitutes an appropriately reduced basis, once more points back to your value system.  Are there a reduced set of questions that do not overlap with one another, or whose answers do not constrain one another in any way, that give rise to all the data we observe?

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 3:24am

If we approach this from an information theoretic perspective, we can define the information content of a signal, but still have no conception of meaning or importance of the signal.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 3:14am

Even dimension reduction techniques such as PCA or ICA require that you have some value system that you impose on the problem, e.g., orthogonality, or independence.  If you choose a wavelet approach, you have a different set of values.  So, I think even if you don't pose an explicit question, your values regarding the solution reveal the objective of your search. So, in some sense, wherever you go, there *you* are.

Comment by Pradyumna S. Upadrashta on February 18, 2015 at 2:52am

Also, you can't find a gold nugget, unless you first define what gold and nugget are in the context you are searching for them.  Someone who doesn't understand the value of a gold nugget, would likely throw it out as a mere piece of rock.  So, in any search process you are defining the cost function, the objective function, or as in this case, a value function.

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service