Many of those who call themselves statisticians just won't admit that data science heavily relies on and uses (heretical, rule-breaking) statistical science, or they don't recognize the true statistical nature of these data science techniques (some are 15-year old), or are opposed to the modernization of their statistical arsenal. They already missed the train when machine learning became a popular discipline (also heavily based on statistics) more than 15 years ago. Now machine learning professionals, who are statistical practitioners working on problems such as clustering, far outnumber statisticians.
Statisticians borrowing the Hadoop picture, see source
Many times, I have interacted with statisticians who think that anyone not calling himself statistician, knows nothing or little about statistics; see my recent bio published here, or visit the LinkedIn profiles of many data scientists, to debunk this myth.
Any statistical technique that is not in their old books are considered heretical at best, or non-statistic at worst, or most of the time, not understood. New statistics or fake data science textbooks are published every week but with the exact same technical content: KNN clustering, logistic regression, naive Bayes, decision and boosted trees, SVM, Bayesian statistics, centroid clustering, linear discrimination - as in the early eighties, applied to tiny data such as Fisher's iris data set. Rarely do they include anything new (that is, less than 10 years old) such as ensemble methods, Lasso and ridge regression or Bayesian networks. Some new statistics textbooks include analyses of small Twitter data using R or sometimes Python, or talk about association rules or recommendation engines, but are still far away from real applied statistical data science.
1. The Decline of Old-Fashioned Statistics
If you compare traffic statistics (Alexa rank) from top traditional statistics websites, with data science websites, the contrast is surprising. In the table below, the lower the rank, the higher the traffic volume, with a rank of 50,000 getting about twice as much traffic as a rank of 100,000:
Popular statistics websites:
Popular data science websites
Google keyword trends tells the same story.
These numbers are based on Alexa rankings, which are notoriously inaccurate, though over time, they have improved their statistical science to measure and filter Internet traffic, and the numbers that I quote here have been stable recently, showing the same trend for months, and subject to a small 30% error rate (compared to 100% error rate a few years ago, based on comparing Alexa variances over time for multiple websites that we own and for which we know exact traffic stats after filtering out robots). These numbers are for US traffic only, which represents between 25% to 55% of the traffic on these websites, but the US traffic is the only one that can be easily and well monetized - thus representing 90% of all business value. TrafficEstimate.com is another free website that you can use to compare traffic statistics, and then there are third party vendors such as compete.com that provide additional information for a very expensive fee.
Despite being a dissident and disruptor in the statistical community, we attract more and more statisticians (you are welcome to join us), as well as mathematicians (many in the defense or financial industry), or operations research practitioners, and even clinical trial experts. Many people don't really care about their job title: there are many different flavors of data scientists. For instance, my primary job titles are CFO, owner, entrepreneur, co-founder, business hacker, and my secondary job titles are data scientist and consultant. Of course, I am a lean business data scientist, thus being business hacker is actually part of being a data scientist for me, as well as for many other business data scientists. It explains why we are growing much faster than competitors that have far more employees and overhead: we leverage unusual (heretical) applied statistical techniques such as ISP segmentation to optimize eBlasts, RSS feed optimization and automated syndication (to optimize reach - a growth hacking technique), content mix optimization, reverse word of mouth advertising, computational marketing, mathematical optimization of spending (revenue spending, and value added to clients to increase revenue and re-invest in our business without external funding), and various business hacks, to boost our growth.
2. Modern Statistical Techniques Used in Data Science
Modern statistical data science techniques are far more robust than traditional statistics, and designed for big data. I have implemented many over the last 15 years with success stories to tell (see section 4 below), and have reached a point where I feel confident about automating data science, exploratory data analysis and statistical science. I'm preparing a book on data science automation.
Here are a few of these techniques:
More (from other authors, in particular) will be published in our data science research center.
3. Old Statistical Principles still used in Data science
Still, there are a number of old-fashioned techniques and principles that are here to stay, and even experience growth, such as experimental design, sampling, or Monte Carlo simulations. Some, such as p-value and traditional hypothesis testing will die or are already dead, to be replaced by simple, model-free techniques that everyone can easily understand and apply. Some are currently used recklessly and need to be significantly improved to make them robust and suitable for black-box, automated algorithms. I'm working on this, for instance transforming traditional regression into what I call Jackknife regression, to make it usable by non-experts.
A number of great old principles, here to stay, are listed in this SimplyStatistics.org article, and include:
I would also add: always try to explain and reduce sources of variance, without over-fitting, using sound cross-validatation and sensitivity analyses. Though I even questioned the traditional definition of variance, and have proposed a few alternatives, including L1 and scale-invariant variance (also see my book pages 187-193) to be used when scale does not matter - but considered by many statisticians as a taboo.
4. Big Data Success Stories, Leveraging Statistical Data Science
My Hidden Decision Trees methodology (an hybrid approximate, very robust constrained logistic regression blended with a few hundred small, stable decision trees, relying on fast combinatorial feature selection to optimize a newly created metric called predictive power - see my book for details) has been used to process and score billions of clicks, IP addresses (sometimes in real time) , keywords and detect some of the largest Botnets impacting the digital advertising industry. It is heavily statistical in nature, uses model-free, data-driven confidence intervals, and pretty much none of the statistical techniques described in any textbooks other than mine. It is indeed data science. Most recently, it was used to detect large-scale criminal activity hiding behind Amazon AWS (click here and also here for references) and not detected by Amazon, by analyzing ad network data and data from Spamhaus, Barracuda, ProjectHoneyPot, StopForumSpam, Adometry and some other sources including social network data. Interestingly, it was not based on Amazon data, but instead, leveraged external data sources exclusively.
When I posted this fact in reaction to SimplyStatistics.org's article Why big data is in trouble: they forgot about applied statistics, my comment was quickly removed. It makes you wonder whether they try ro obfuscate the use of statistics in data science, or want to stick to old statistics by fear of progress, rigid thinking (tunnel vision) or inability to adapt. Anyway, many of the statistical techniques that I used to turn big data into value, are described in the newly created data science research center, but you will never see them described nor even mentioned in traditional statistical publications (except in my Wiley book).