The definition of 'best' depends on which school you follow. Data science and classic statistical science are at the opposite ends of the spectrum. So let's clarify what 'best solution' means in these two opposite contexts:
'Best', according to statistical science:
- It usually means the global maximum of a mathematical optimization problem
- The objective function involved is usually a maximum likelihood function, KS, c-statistics, or some function derived from statistical models
- Emphasis is on obtaining unbiased (model-unbiased, not data-unbiased) estimates
- Arcane techniques, and jargon looks obscure to the non initiated (example below), and this science seems to be the privilege of some intellectual elite
- Lift is measured using a confusing bunch of various metrics, many being sensitive to outliers
'Best', according to data science:
- Approximate (not exact) solution to a mathematical optimization problem is OK: data is not perfect anyway. Example: detecting a good set of features that have strong predictive power (this is a combinatorial problem, but you can get a good approximation in a couple of hours in most cases, no need to run your code for months to gain 5% additional accuracy, lose stability, and burn corporate money in the process)
- Combining multiple approximate solutions, as opposed to one global (possibly unstable) maximum
- A small bias is OK
- Fast algorithms, simple source code, fast deployment, is preferred over perfection
- Stability of solutions, scalability, and ease of interpretation, are critical
- ROI and efficiency (after factoring in the cost of data science) is paramount
- Anti-elitist, focus is on making decision makers understand what we do, in simple English, and get them optimize business processes and decisions
- Lift is measured using a few simple outlier-resistant metrics such a predictive power of the retained subset of features, or mis-classification rate
There are other differences. An interesting and recent discussion by Andrew Gelman (one of the most famous statisticians) generated some criticism against data science:
- My model-free, jargon-free, simple confidence interval computation technique was deemed wrong. Interestingly, it provides the exact same results as model-based statistical confidence intervals in many contexts: read the peer-reviewed article in details
- One commenter (whose name is R) wrote that this methodology is indeed a bootstrap-like approach without re-sampling that happens to be estimating the distribution of the wrong quantity. Since it produces the same results as classic stats, does it mean that classic stats are also wrong? And man, what a lot of jargon used in R's sentence, whoever this guy is. Not only jargon, but statistical science not found in any standard statistics textbook. So how is the layman supposed to know about it?
In the end, there is no such thing as a real data scientist or statistician. It's all about a personal feeling reflecting your career. Some statisticians are more data scientist than me. You can say the same thing about any profession. Some have tried to create laws about appellations, but that's the wrong approach. In the new language that I promote, called New English, anyone can call herself lawyer, doctor, married, data scientist, bank - you name it. Just do your due diligence before hiring or talking to someone who claims to have some credentials, whatever these credentials might be.
I will finish this article with the following statement, which epitomizes the data science approach, as well as the reluctance by a few change-adverse people to try it:
Earth gravitates around the Sun, not the other way around (Galileo Galilei, 1614). Statistical models gravitate around data, not the other way around (Vincent Granville, 2014).
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge