# Data Science Has Been Using Rebel Statistics for a Long Time

Many of those who call themselves statisticians just won't admit that data science heavily relies on and uses (heretical, rule-breaking) statistical science, or they don't recognize the true statistical nature of these data science techniques (some are 15-year old), or are opposed to the modernization of their statistical arsenal. They already missed the train when machine learning became a popular discipline (also heavily based on statistics) more than 15 years ago. Now machine learning professionals, who are statistical practitioners working on problems such as clustering, far outnumber statisticians.

Statisticians borrowing the Hadoop picture, see source

Many times, I have interacted with statisticians who think that anyone not calling himself statistician, knows nothing or little about statistics; see my recent bio published here, or visit the LinkedIn profiles of many data scientists, to debunk this myth.

Any statistical technique that is not in their old books are considered heretical at best, or non-statistic at worst, or most of the time, not understood. New statistics or fake data science textbooks are published every week but with the exact same technical content: KNN clustering, logistic regression, naive Bayes, decision and boosted trees, SVM, Bayesian statistics, centroid clustering, linear discrimination - as in the early eighties, applied to tiny data such as Fisher's iris data set. Rarely do they include anything new (that is, less than 10 years old) such as ensemble methods, Lasso and ridge regression or Bayesian networks. Some new statistics textbooks include analyses of small Twitter data using R or sometimes Python, or talk about association rules or recommendation engines, but are still far away from real applied statistical data science.

1. The Decline of Old-Fashioned Statistics

If you compare traffic statistics (Alexa rank) from top traditional statistics websites, with data science websites, the contrast is surprising. In the table below, the lower the rank, the higher the traffic volume, with a rank of 50,000 getting about twice as much traffic as a rank of 100,000:

Popular statistics websites:

• American Statististical Association: rank = 142,000
• Operations Research Society: rank = 51,000
• Andrew Gelman's website: rank = 127,000
• Statistics.com: rank = 75,000
• SimplyStatistics: rank = 113,000

Popular data science websites

• Data Science Central: rank = 31,000
• Analyticbridge: rank = 73,000

Google keyword trends tells the same story.

These numbers are based on Alexa rankings, which are notoriously inaccurate, though over time, they have improved their statistical science to measure and filter Internet traffic, and the numbers that I quote here have been stable recently, showing the same trend for months, and subject to a small 30% error rate (compared to 100% error rate a few years ago, based on comparing Alexa variances over time for multiple websites that we own and for which we know exact traffic stats after filtering out robots). These numbers are for US traffic only, which represents between 25% to 55% of the traffic on these websites, but the US traffic is the only one that can be easily and well monetized - thus representing 90% of all business value. TrafficEstimate.com is another free website that you can use to compare traffic statistics, and then there are third party vendors such as compete.com that provide additional information for a very expensive fee.

2. Modern Statistical Techniques Used in Data Science

Modern statistical data science techniques are far more robust than traditional statistics, and designed for big data. I have implemented many over the last 15 years with success stories to tell (see section 4 below), and have reached a point where I feel confident about automating data science, exploratory data analysis and statistical science. I'm preparing a book on data science automation.

Here are a few of these techniques:

More (from other authors, in particular) will be published in our data science research center.

3. Old Statistical Principles still used in Data science

Still, there are a number of old-fashioned techniques and principles that are here to stay, and even experience growth, such as experimental design, sampling, or Monte Carlo simulations. Some, such as p-value and traditional hypothesis testing will die or are already dead, to be replaced by simple, model-free techniques that everyone can easily understand and apply. Some are currently used recklessly and need to be significantly improved to make them robust and suitable for black-box, automated algorithms. I'm working on this, for instance transforming traditional regression into what I call Jackknife regression, to make it usable by non-experts.

A number of great old principles, here to stay, are listed in this SimplyStatistics.org article, and include:

• If the goal is prediction accuracy, average many prediction models together.
• When testing many hypotheses, correct for multiple testing
• Unless you ran a randomized trial, potential confounders should keep you up at night
• Define a metric for success up front
• Problem first not solution backward

I would also add: always try to explain and reduce sources of variance, without over-fitting, using sound cross-validatation and sensitivity analyses. Though I even questioned the traditional definition of variance, and have proposed a few alternatives, including L1 and scale-invariant variance (also see my book pages 187-193) to be used when scale does not matter - but considered by many statisticians as a taboo.

4. Big Data Success Stories, Leveraging Statistical Data Science

My Hidden Decision Trees methodology (an hybrid approximate, very robust constrained logistic regression blended with a few hundred small, stable decision trees, relying on fast combinatorial feature selection to optimize a newly created metric called predictive power - see my book for details) has been used to process and score billions of clicks, IP addresses (sometimes in real time) , keywords and detect some of the largest Botnets impacting the digital advertising industry. It is heavily statistical in nature, uses model-free, data-driven confidence intervals, and pretty much none of the statistical techniques described in any textbooks other than mine. It is indeed data science. Most recently, it was used to detect large-scale criminal activity hiding behind Amazon AWS (click here and also here for references) and not detected by Amazon, by analyzing ad network data and data from Spamhaus, Barracuda, ProjectHoneyPot, StopForumSpam, Adometry and some other sources including social network data. Interestingly, it was not based on Amazon data, but instead, leveraged external data sources exclusively.

When I posted this fact in reaction to SimplyStatistics.org's article Why big data is in trouble: they forgot about applied statistics, my comment was quickly removed. It makes you wonder whether they try ro obfuscate the use of statistics in data science, or want to stick to old statistics by fear of progress, rigid thinking (tunnel vision)  or inability to adapt. Anyway, many of the statistical techniques that I used to turn big data into value, are described in the newly created data science research center, but you will never see them described nor even mentioned in traditional statistical publications (except in my Wiley book).

Views: 13431

Comment

Join Data Science Central

Comment by Arnuld on August 24, 2018 at 7:13am

I just started to learn Data Science few months ago and used to think that Old School Statistics is very important for a Data Scientist and I was planning to buy lot of books and lot of reading and practicing  (e.g. from John A Rice,  Mood, Graybill & Boes, Seber G.A.F. etc). IN fact I have spent last 2 months on learning College Level Algebra MOOCs and other online courses, Pre-Calculus and right now an hour ago just finished Derivatives.   Have already read at least a thousand  articles on useful and importance of Statistics in Data Science. But I never ever came across this sharp distinction of Big Data  Statistics and Old School Statistics, yes, not even one article mentioned such (forget about the implementation). It has changed my approach to learning Data Science. Thanks for  this.

Comment by Steven Neumersky on May 25, 2015 at 2:49pm
Vincent,

Would you say the SQL Server Data Mining algorithms are now obsolete? Of course, its extensibility could still make it useful, and I know you have mentioned Predixion Software as being pretty good when used with PowerPivot.

Thanks.
Comment by Chris on June 7, 2014 at 10:03am

I tend to agree with Mirko on the account that Hindu documentation, Morrocco algebra, Egyptian and more...

Just got this by cheer occasion http://www.deep-data-mining.com/2013/05/the-10-most-influential-peo...

Comment by Sumedha Sengupta on June 6, 2014 at 11:01am

A Deja Vu !

As a Statistician, among many others, I seem to be have missed THAT boat. This movement' reminds me of another one  back in the late 70's and early 80's, when Dr. Edward Deming called upon the Statisticians from the academia to participate in the Quality movement.  And I was one of them who already had a background in Statistics and Reliability, under went new Quality training and entered the industrial world from a teaching and research environment.  Statisticians  who joined the call, were  to design Statistical Process Control courses,  teach them to train in-house and apply statistical techniques to resolve Manufacturing Process related problems. Most of the industries did not have positions tilted 'Statistician' , so we were given positions of 'Quality Engineer' or 'Quality Statistician'. Of course then we were also the ones got "Clobbered" first. Forgive my expression.

I am wondering if that is the intention of this movement? Seems like Big Data or Data Science movement needs Statistics, but does not want the Statisticians. It needs people with Solid background in Statistics for enormous data collection, analysis and interpretation which a trained Data Scientist might provide, because it seems the techniques for those are not yet  quite  fully established.

I can only say, that, isn't it a reality  that after one visualizes, reduces and does some clustering or pattern recognition on the whole set of the Big Data volume, one has to resort to some specific methods to analyze and interpret on a somewhat smaller set of Data? Is it not the intention at the first place to cluster or mine important factors to isolate such 'smaller' sets so some meaningful Statistical Analysis can be conducted on them?

I agree that, may be not all Statistical methods to handle that part of the analysis and interpretation have not been fully developed yet. But, there are many existing methods that are fully developed and might be very well applicable.

If so, where does one draw a line between a Statisticians and a Data Scientists?

Is the objective here to have a group of 'Para-Statisticians' like Para-legals or Para-medical ?

Sumedha

in

Livermore, CA

Comment by Mirko Krivanek on June 1, 2014 at 6:26pm

Vincent, you keep referring to something 15-year old, as if some big event happened 15 years ago - like a massive virtual meteorite hitting Earth, the kind that killed dinosaurs - and creating disruption in the scientific community at large, including college education. What is this mysterious event?