# Data science without statistics is possible, even desirable

I will start with a controversial statement: data science barely uses statistical science and techniques. The truth is actually more nuanced, as explained below.

1. Data science heavily uses new statistical science

But the new statistical science in question is not regarded as statistics, by many statisticians. I don't know how to call it, "new statistical science" is a misnomer, because it is not all that novel. And it is regarded by statisticians as dirty data processing, not elegant statistics.

It contains topics such as

• Hidden decision trees: blending multiple simple scoring/clustering engines to take advantage of sparsity in big data
• Predictive power for data reduction and feature selection
• Combinatorial feature selection
• Jackknife regression: unites hundreds of confusing regression methods, providing a simple, easy-to-interpret, robust predictive technique based on approximate, low-dimensional solutions, avoiding the curse of dimensionality, avoiding over-fitting, and providing slightly biased estimates (a terrible sin for statisticians!)
• Random numbers: new kind of simulators of high quality, and non-periodic
• Pattern recognition: detection of structures and signal, identifying true signal in an ocean of spurious correlations
• Clustering big data without using n x n matrices, but using hash tables and Map-Reduce
• Real time architectures for predictive modeling
• Model-free confidence intervals based on statistics computed across multiple similar data bins
• Synthetic metrics such as bumpiness coefficient, or scale-independent variance (another big sin for statisticians!)
• Data bucketization, interpolation and extrapolations at the bucket level

I have sometimes used the word rebel statistics  to describe these methods.

While I consider these topics to be statistical science (I contributed to many of them myself, and my background is in computational statistics), most statisticians I talked to do not see it as statistical science. And calling this stuff statistics only creates confusion, especially for hiring managers.

Some people call it statistical learning. One of the precursors of this type of methods is Trevor Hastie who wrote one of the first data science books, called The Elements of Statistical Learning.

2. Data science uses a bit of old statistical science

Including the following topics, which curiously enough, are not found in standard statistical textbooks:

• Time series, ARMA (auto-regressive) processes, correlograms
• Spatial and cluster processes
• Survival models
• Markov processes
• Goodness-of-fit techniques
• Experimental design 101 (not the advanced techniques used in clinical trials)
• A/B and multivariate testing, but without traditional tests of hypotheses
• Simulation, Markov-Chains Monte-Carlo methods
• Some Bayesian stuff, hierarchical models
• Rank statistics, percentiles, outliers detection (preferably not model-based)
• The concept of statistical significance (but not p-values or power)
• Cross validation
• Imputation techniques (dealing with missing data)
• Exploratory data analysis (to be automated with tools such as data dictionary, then it won't be old statistics anymore)
• Sampling
• Some statistical distributions
• Random variables
• Some asymptotic results, although I encourage Monte-Carlo simulations to obtain limiting distributions, rather than theoretical principles which may not apply to real, modern data

These techniques can be summarized in one page, and time permitting, I will write that page and call it "statistics cheat sheet for data scientists". Interestingly, from a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page. For instance, I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples, and defining distribution as the expected (based on model) or  limit of a frequency distribution (histogram).

Funny fact: some of these classic stats texbooks still feature tables of statistical distributions in an appendix. Who still use such tables for computations? Not a data scientist, for sure. Most programming languages offer libraries for these computations, and you can even code it yourself in a couple of lines of code. A book such as numerical recipes in C++ can prove useful, as it provides code for many statistical functions; see also our source code section on DSC, where I plan to add more modern implementations of statistical techniques, some even available as Excel formulas.

3. Data science uses some operations research statistical science

In particular, OLS (ordinary least squares) , Monte-Carlo techniques, mathematical optimization, the simplex algorithm, inventory and pricing management models.

These techniques are not considered statistical science, they are often referred to as analytics or decision science.

4. Data science does not use most old statistical science

Some of these techniques have been heavily criticized:

Examples of old statistical techniques that I have never used recently in my data science career:

• Maximum likehood estimation (modern data produces complex likelihood functions with many local optima, or even non continuous likelihood functions)
• Regression
• Naïve bayes
• ANOVA
• Tests of hypotheses
• General linear model

Now don't get me wrong, there are still plenty of people doing naive Bayes, linear or logistic regression, and it works on many simple data sets, and you'll get a job if you know these techniques, more easily than if you don't know them, because progress is slow. But the future is in uniting these techniques under a single methodology, simple, robust, with easy-to-interpret results, available as black box to non-experts, and easy to automate. This project (I'm working on it, some computer science people at Cambridge University also work on this) is sometimes referred to as the automated statistician.

But just to give an example, naive Bayes (old stats, still widely used unfortunately) is terrible at detecting spam and categorizing email because it wrongly assumes that rules are independent, while a modern version called hidden decision trees (new stats) has been very successful (combined with pattern recognition) at identifying massive Botnets. Some modern techniques such as recommendation engines sometimes fail (unable to detect fake reviews) because they still rely on old, poor statistical techniques rather than modern data science. Though the fix to this issue is reworking the business model, rather than improving data science algorithms.

Finally, old statistics use a top-down approach, from model and theory to data, while new statistics or data science use a bottom-up approach, from data to model or algorithm.

Conclusions

Based on what many statisticians think statistical science is, and is not, I am tempted to say that modern data science barely uses statistical science. Instead, it mostly relies on statistical principles that are not considered statistical science by most people who call themselves statisticians, because of their rigid perception of what statistics is, and their inability to adapt to change.

To the contrary, for non statisticians (computer scientists, engineers and so on), it is clear that data science has a strong statistical component. In my heart, I also believe that new statistics is also a core component of data science. Yet when talking to hiring managers, I tell them that statistics is another animal, because in their mind, statistics is old statistics. And old statistics is barely used anymore in modern data science. Likewise, when talking to statisticians, I tell them that data science is not statistics, to not upset them or waste my time in fruitless argumentation.

DSC Resources

Views: 48657

Tags: predictive modeling

Comment

Join Data Science Central

Comment by Myles Gartland on December 10, 2014 at 7:22pm

I take this to mean DS ""without classical inferential statistics", which I would agree with. We are usually asking different questions. I teach Stats 101 and stats 102 and also a practicing DS. I rarely use anything from 101 or 102 in my work. That said, I still feel I am using statistics- just not the kind we all learned (or maybe not applied in the same way).

Comment by Renato P. dos Santos on December 10, 2014 at 12:03pm

Well said, Vincent.

Comment by Vincent Granville on December 10, 2014 at 9:36am

Some people asked what my logic is, to categorize techniques as old or new statistics. My answer is:

20 years of corporate experience across multiple industries, working with various forms of data with various teams, in environments ranging from start-up founder to consultant with eBay, Visa, Wells Fargo, Microsoft, and NBC, to projects with government agencies (EPA), my post-docs at Cambridge University and the National Institute of Statistical Sciences, patents, papers published in top statistical journals, Wiley book etc.

I found people from Chinese origins to be more opposed to my school of doing data science, than Westerners. Chinese people tend to receive a very mathematical and theoretical education (though I did too, I even published in Journal of Number Theory), and they tend to stick to it and love it, in contrast with other cultures.

Comment by Javier Cano on December 10, 2014 at 2:13am
Be careful, no math no science.
Comment by Renato P. dos Santos on December 10, 2014 at 12:46am

Great article, Vincent.

What can we do to change this view when managers, interviewers and even textbooks (and, of course, teachers) still rely on these old methods?