# 10 Modern Statistical Concepts Discovered by Data Scientists

You sometimes hear from some old-fashioned statisticians that data scientists know nothing about statistics, and that they - the statisticians - know everything. Here we prove that actually it is the exact opposite: data science has its own core of statistical science research, in addition to data plumbing, statistical API's, and business / competitive intelligence research. Here we highlight 11 major data science contributions to statistical science. I am not aware of any statistical science contribution to data science, but if you know one, you are welcome to share.

Here's the list:

1. Clustering using tagging or indexation methods (see section 3 after clicking on the link), allowing you to cluster text (articles, websites) much faster than any traditional statistical technique, with a scalable algorithm very easy to implement
2. Bucketization - the science and art of identifying the right homogeneous data buckets (millions of buckets among billions of observations), to provide highly localized (or segment-targeted) predictions, or to smooth regression parameters across similar buckets, with strong statistical significance. It is equivalent to joint (not sequential) binning in multiple dimensions, which is a combinatorial optimization problem. While decision trees also produce some bucketization, the data science approach is more robust, simple, scalable and model-free. It does not directly produce decision trees, and lead to easy interpretation (each data bucket corresponding to a specific type of fraud, in a fraud detection problem). A related problem is bucket clustering, via standard hierarchical clustering techniques.
3. Random number generation, a 3,000 year old problem, benefited from data science advances: for instance, using the digits of irrational numbers such as Pi or SQRT(2), produced with very fast algorithms, to simulate randomness.
4. Model-free confidence intervals, getting rid of p-value, hypothesis testing, asymptotic analysis, errors due to poor model-fitting or outliers, and of a bunch of obscure statistical old-fashioned concepts
5. Variable / feature selection and data reduction, without using L2-based, model-based techniques such as PCA, potentially numerically unstable, which are sensitive to outliers, and lead to difficult interpretation
6. Hidden decision trees, an hybrid technique combining some sort of averaged decision trees and Jackknife regression, more accurate, and far easier to code, implement, and interpret than either logistic regression or traditional decision trees. Not subject to over-fitting, unlike its ancestor statistical techniques.
7. Jackknife regression, a universal, simplified regression technique, easy to code and to integrate in black-box analytical products. Traditional statistical science offers hundreds of regression techniques, nobody but statisticians know which one to use, and when, obviously a nightmare in production environments.
8. Predictive power and other synthetic metrics designed for robustness rather than for mathematical elegance
9. Identification of true signal in data subject to the curse of big data (spurious correlations)
10. New data visualization techniques - in particular using data video to display insights
11. Better goodness-of-fit and yield metrics, based on robust L1 rather than outlier-sensitive L2 metrics.

DSC Resources

Views: 55312

Comment

Join Data Science Central

Comment by Sander Stepanov on June 20, 2016 at 8:22am

Thank you very much, really great material!!

only may you pls share material from #2 Bucketization , link seems to be is broken

Comment by Dr Vincent Micali on July 30, 2015 at 9:51pm

Well and diplomatically said  Prof Hart

Dear Vincent,

Having served as a Statistician analysing Data for 40 years, there is a touch of arrogance in the statement that Statisticians "... they know everything". For the lucid Data Scientists, is you permit,safely considering one myself, it is indeed a contradiction, since Statisticians (whether Bayesians or non-Bayesians) know very well that the probability of "knowing everything" is zero. So, perhaps I should take the prior of Carlos to inform my predictive probability that you made those statements with "tongue in cheek" to spurn a discussion. If the conditional is true, then well done: you succeeded; if it's false, then you should seriously revisit your statements, back them up with scientific evidence and perhaps a good starting point is the works of Sir Harold Jeffreys

Cheers and take care

Dr Vincent Micali

MSc (Warwick), PhD (UFS)

Comment by Carlos Aya on March 6, 2015 at 2:10pm

Dear Vincent,

"unaware of any statistical science contribution to data science" -> Tongue in cheek, right :) ?

I do acknowledge that data abundance and (more importantly) the universal availability of computers have posed a tremendous challenge to mathematics and statistics. Anyone can modify an existing algorithm that fails, and "make it work" for his/her particular case - or even invent "new" ones.

But this does not mean this is science... in the sense that showing that it works, even in many situations, does not explain why they work, and how they could fail (i.e. what underlying assumptions are required for it to work).

Yes, it is published as research - but, believe me, it is more an _unsolved problem_ for a mathematician or statistician than well founded finished research.

If you allow me the analogy, is like that in ancient Babylon it was common knowledge among builders that certain Pythagorean triangles existed - but it required formal geometry to explain what was really going on.

So, you want recent contributions? Here one: search for "functional data analysis" in google scholar ... enjoy :)

Anyway, data scientists and everybody ... yes, keep using computers, your challenges are welcome (but even better if you join the "theoretical" camp and help to explain why...)

Kind regards

Carlos

Comment by George F. Hart on March 2, 2015 at 8:52am

G'day Vincent:

as always I find your comment both interesting and insightful. I have always regarded myself as a data-analyst – even before the modern idea was invented [I'm 80]. I have also applied statistical reasoning for the past 60 years to data analysis. I do agree that data scientists have contributed significantly to statistics in the sense I understand the field. However, to be 'unaware of any statistical science contribution to data science' must have been ghost-written. It is not you as I have read you over the past few years! I assume you are being provocative simply to get something going – which is my own method of teaching.

Mathematical methods have nothing to do with it. The underlying concepts of statistics are what statistical science has contributed to data analysis.

what is the chance of a random sample taken from a location being simply a variant within the population of interest, or alternatively, that it is from a totally different population”.

Or the application of basic functions:

“mean(), median(), sd(), var(), min(), max(), range(), summary() , sort(), order(), rank() , exp(), log(), sin(), cos(), tan() [radians] , length() , rev() , sum(), cumsum(), prod(), cumprod(), round(), ceil(), floor(), signif() , which(), which.max() , any(), all(), and mode()”

Or the basic model:

Y = (something) + (error of measurement), where Y is said to be the dependent variable that is being measured, and (something) is some relationship among the so-called independent variables that control or predict Y”

Or

“ rejecting the hypothesis vs failing to reject the hypothesis”.

“To judge the reliability of any experimental result it must be compared with an estimate of it's error i.e. a test of significance. The test of significance separates the subjective guess from fact [more correctly the failure to reject a hypothesis pertaining to a fact].”

Or:

“The innate control of error by multiple replication” This provides a major advantage to, and is a principal reason for, the success of modern 'big data' analysis. It leads to theridea that in data analysis we are dealing with the total population not a statistical sample [we both know that is not true but it is suffice to justify what is done].

I could go on but you and most of your reader know this stuff already. Data analysis has grown but it still have the underpinnings of statistical analysis. For the future of statistical analysis I advise keeping a close-eye on deep learning methods.

Luv and kisses as always,

George Hart,

Professor emeritus,

LSU.

Comment by Sione Palu on February 23, 2015 at 7:45am

There's a good article on random number generation by Prof. Cleve Moler from MathWorks here:

http://www.mathworks.com/tagteam/9674_randomthoughts.pdf

1

2

3

4

5

6

7