Subscribe to DSC Newsletter

Best solution to a problem: data science versus statistical paradigm

The definition of 'best' depends on which school you follow. Data science and classic statistical science are at the opposite ends of the spectrum. So let's clarify what 'best solution' means in these two opposite contexts:

'Best', according to statistical science:

  • It usually means the global maximum of a mathematical optimization problem
  • The objective function involved is usually a maximum likelihood function, KS, c-statistics, or some function derived from statistical models
  • Emphasis is on obtaining unbiased (model-unbiased, not data-unbiased) estimates
  • Arcane techniques, and jargon looks obscure to the non initiated (example below), and this science seems to be the privilege of some intellectual elite
  • Lift is measured using a confusing bunch of various metrics, many being sensitive to outliers

'Best', according to data science:

  • Approximate (not exact) solution to a mathematical optimization problem is OK: data is not perfect anyway. Example: detecting a good set of features that have strong predictive power (this is a combinatorial problem, but you can get a good approximation in a couple of hours in most cases, no need to run your code for months to gain 5% additional accuracy, lose stability, and burn corporate money in the process) 
  • Combining multiple approximate solutions, as opposed to one global (possibly unstable) maximum
  • A small bias is OK
  • Fast algorithms, simple source code, fast deployment, is preferred over perfection
  • Stability of solutions, scalability, and ease of interpretation, are critical
  • ROI and efficiency (after factoring in the cost of data science) is paramount
  • Anti-elitist, focus is on making decision makers understand what we do, in simple English, and get them optimize business processes and decisions
  • Lift is measured using a few simple outlier-resistant metrics such a predictive power of the retained subset of features, or mis-classification rate

There are other differences. An interesting and recent discussion by Andrew Gelman (one of the most famous statisticians) generated some criticism against data science:

  • My model-free, jargon-free, simple confidence interval computation technique was deemed wrong. Interestingly, it provides the exact same results as model-based statistical confidence intervals in many contexts: read the peer-reviewed article in details
  • One commenter (whose name is R) wrote that this methodology is indeed a bootstrap-like approach without re-sampling that happens to be estimating the distribution of the wrong quantity. Since it produces the same results as classic stats, does it mean that classic stats are also wrong? And man, what a lot of jargon used in R's sentence, whoever this guy is. Not only jargon, but statistical science not found in any standard statistics textbook. So how is the layman supposed to know about it?

In the end, there is no such thing as a real data scientist or statistician. It's all about a personal feeling reflecting your career. Some statisticians are more data scientist than me. You can say the same thing about any profession. Some have tried to create laws about appellations, but that's the wrong approach. In the new language that I promote, called New English, anyone can call herself lawyer, doctor, married, data scientist, bank - you name it. Just do your due diligence before hiring or talking to someone who claims to have some credentials, whatever these credentials might be.

I will finish this article with the following statement, which epitomizes the data science approach, as well as the reluctance by a few change-adverse people to try it:

Earth gravitates around the Sun, not the other way around (Galileo Galilei, 1614). Statistical models gravitate around data, not the other way around (Vincent Granville, 2014).

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 5185

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dominic Lusinchi on November 10, 2016 at 8:44am

Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY.

"...all models are wrong, but some are useful." (424)

Comment by Alex Sánchez on December 28, 2014 at 8:02am
Nice post but, IMHO, it provides an excessively narrow-minded view of statistics and statisticians in front of a too optimistic view of data-science and data scientists.
Probably we all would like to have optimal solutions to the problems we work with but statistics has plenty of tools to deal with sub-optimal situations that have not even been mentiones.
Being a statistician myself, I have been binarized too many times --probabilist vs statistician, frequentist vs bayesian, applied vs theoretical-- although I, like many others, have always tried to take the best of every world when needed.
I think that what is needed right now is to look for -and to show- those aspects that Statistics and Data Science share and how tey can complement each other.
This will no doubt help future generations of ...whatever we call them more than yet another unnecessary binarization
Comment by Martin Chen on December 22, 2014 at 9:56am

The data science you talked about here in this article is just data mining which is a branch of statistics. I am afraid to say that the statistical science you talked about in this article is too classic. 

Personally I think the comparison is not meaningful. We shouldn't differentiate between data science and statistics so much. Data science can be considered using knowledge and techniques from many subjects. For example, as a good data scientist, one should have a good knowledge of database, programming language, data mining methodologies, etc. For many data scientists right now, statistical knowledge probably will be the most important one. 

As a statistician myself, I can see that statistical science has evolved during the past years. Statistics is about getting insights from data, not necessarily the most optimized or unbiased solution. 

Comment by Pradyumna S. Upadrashta on December 19, 2014 at 7:16pm

A statistician likes their data neat, with their olives in the middle "just so".  A data scientist likes their data on the rocks, shaken but not stirred.  The decision makers like their kool-aid, until their shareholders tell them otherwise.  IT likes to play the bartender that keeps the drinks flowing and the tab running.  Meanwhile, the Big Data guys are like the bouncer that determine who gets in.

For the record, I will have water, no ice, with a wedge of lime, and keep on modelin'... someone has to drive.

Comment by Randolph Abelardo on December 19, 2014 at 4:36pm
George E. Box wrote that, "essentially, all models are wrong, but some are useful" in his book on response surface methodology with Norman R. Draper.
- http://en.m.wikipedia.org/wiki/George_E._P._Box
Comment by Mitchell A. Sanders on December 18, 2014 at 10:19am

Another iconoclastic redefining article from Dr. Granville that's ahead of the curve on what data science is all about. Appreciate the "anti-elitist" call-out. I must say I have a bias for quality models myself, but practicality and fast turn-around is what business is all about. Nice work again Dr. Granville for saying what many of us think already. Appreciate your leading voice. 

Comment by Martin Jetton on December 15, 2014 at 10:46am

Could it be said that models are abstractions of the underlying "reality"? And that data are just a by product of the 'reality?'   And being abstractions and by products both can only come up short of reality? 

I'm more of an empiricist, models be darned.   I still want to predict what I want to predict without my panties getting in a bunch of a model.  But that's me and tends to get me in trouble from time to time with those model hunting theorist types.

Comment by Vincent Granville on December 15, 2014 at 9:53am

Models rest upon data. Data rests upon the data collection mechanism. The relationships are hierarchical. If the leading layer (data collection mechanism, including fields to be identified, captured and coded) is poor, no amount of data, and no amount of modeling, will fully make up for the deficiency (though partial recovery is sometimes possible on badly collected data, if data is huge and/or ad-hoc models are used that detect and correct data biases).

Comment by Martin Jetton on December 15, 2014 at 9:04am

"Data is only as good as the tool that collects it."  (Jetton, 2014)  Be careful of giving too much credit to the collection of data for conclusion purposes. Yes, reliability and validity still have to be considered with big dat.  

Plus as a brilliant mind once said, "Models are always wrong, but many are very handy" (who said this?)

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service