There are many topics that you won't learn in statistics classes. Some such as U-Statistics, stochastic geometry, fractal compression, and stochastic differential equations, are for post-graduate people. So it is OK to not find them in statistics curricula. Others like computational complexity and L^1 metrics (to replace R-squared and other outlier-sensitive L^2 metrics such as traditional variance) should be included, in my opinion.
But the classic statistics curriculum is almost written in stone. You can buy any college textbooks - hundreds of them are devoted to statistics, dozens are published each year - they pretty much cover the same topics, and it has barely changed in decades, even though the theory was built well before the age of big data and modern computers.
What about stuff you won't learn in data science classes?
Likewise, there are many things that you won't learn in data science classes. While, unlike statistics, there is no standard curriculum for data science - the discipline being very new - I can speak about my own classes and certification. These classes (an apprenticeship in fact) don't contain material such as hypothesis testing (read here why I think it should not be taught), general linear models (see here why I discard this technique), advanced matrix algebra, naive Bayes (see here why I discard this technique), or maximum likelihood principles.
In my classes (presented as an apprenticeship), this kind of material has been replaced by the following unified approach:
- offering simple, efficient, almost math-free methods and techniques that are easy to use, fine-tune and understand even by the non-expert, leading to easy interpretation of results; some are SQL-friendly
- offering scalable and robust techniques fit for black-box / batch mode analytics and automated processes, with results easy to interpret, and applicable to big, unstructured data (with a technique - indexation, see below - to turn unstructured data into structured data)
- proposing data-driven synthetic, robust metrics (for instance predictive power for feature selection, or L^1 version of R-squared, or L^1 variance, see below) when it makes sense, rather than outlier-sensitive L^2 formulas that can express themselves easily and elegantly in mathematical language,
- and proposing fast algorithms rather than mathematical formulas, some with an implementation in Excel for people who do not code, though a cheat sheet can help you jump-start your programming skills.
Important techniques you won't learn in most statistics classes
Here I focus on techniques, created or re-invented in our data science research lab, that are critical parts of my data science curriculum, yet are usually criticized and/or not taught by a number of statistics professors, as they strongly deviate from the statistical norm, or are just simply ignored, not valued, or not known by these academics. Keep in mind that I started my career as a research then business statistician, so I know both the statistical (model to data, or top-down) and data science (data to model, or bottom-up) paradigms very well.
For each of the techniques below, I quickly explain why I included them in my data science curriculum, and why some statisticians dislike or ignore them.
- Variance that is scale-independent. While the L^p variance that I introduce (with 1 < p < 2) is the main reason to read this article, for statisticians sadly, scale-independent variance makes no sense: it is heresy. But what if you want a variance metric that does not change when you change the units?
- Data-driven, model-free confidence intervals. Some statisticians claim that without a model, without an underlying statistical distribution that need to be estimated, it is impossible to build confidence intervals. Not only I dispel this myth, but my confidence intervals are identical to classical ones, when the number of observations is large, and the data is well behaved (despite claims to the contrary by people who did not read my article in details). Also, it's a generic approach to confidence intervals, while the classic statistical approach uses dozens of formulas depending on the underlying distribution. It can easily be coded in Excel or SQL. And I use the concept of confidence intervals to replace the obscure tests of hypotheses. More on this here. More generally, it is possible to do predictive analytics without predictive models.
- Use of a simple, robust, easy-to-interpret regression technique known as the Jackknife regression. Because it provides a slightly biased solution, it is regarded by some statisticians as heresy. In fact, it is not different from a classic statistical approach that would use penalized likelihood (or Bayesian approach). This incredibly simple technique (easy to code even in Excel) works well even when the variables are strongly correlated, as shown in my article. It proves that a very simple solution can be nearly as accurate as a complex one, nearly in all contexts. Note that no advanced linear algebra is needed to implement this technique. Finally this technique has nothing to do with bootstrapping, re-sampling, or Bradley Efron. This claim is made by people who did not read the article, but instead just the title.
- Use of techniques to detect spurious correlations (such as comparing time series using correlogram analysis rather than correlations), especially when computing correlations among millions of variables (that is, trillions of cross-correlations), is critical to detect signal in an ocean of noise. But they are not taught in many stats classes. Among trillions of correlations, many are bound to be very strong just by chance, due to the curse of big data, a concept foreign to some statisticians, even though most of them are aware of the curse of dimensionality (unrelated to the curse of big data). Likewise, some practitioners repeat the same statistical tests over and over, until they get results that they like. Again, because of the curse of big data, this always leads to erroneous conclusions, and abuses such as p-hacking. Read this article about how to lie with statistics. In fact, most correlations are not causal.
- Hidden decision trees (HDT). A method that blends two techniques - Jackknife regression, see #3, and decision trees - to score transactions. I developed it around 2002 in the context of credit card fraud detection in real time, for Visa, and it has been used to successfully score trillions of transactions since its inception, especially for click or keyword scoring, to assess the quality of web traffic from traffic sources with no or little history. The decision tree component does not require tree splitting and does not have hard-to-handle parameters such as the minimum node size. Much of the HDT efficiency comes from identifying the right compound metrics, the feature selection algorithm (to solve a combinatorial optimization problem), smartly exploiting the sparsity of big data (resulting in the automated creation of dozens of small, easy-to-interpret decision trees, each one corresponding for instance to a special type of fraud), and binning the metrics appropriately using bucketization algorithms. I haven't seen anything close to this in any statistics textbooks. It is considered by many to be a machine learning technique.
- Automated analytics, machine-to-machine communications via API's, for instance to bid on Google keywords or on the stock market. Also useful to automate statistical analyses or EDA using data dictionaries and frequency tables that are automatically parsed and summarized, especially when dealing with unstructured text and keyword data (NLP, that is, natural language processing). A number of statisticians claim that this level of automation is impossible, probably because it could jeopardize their jobs.
- Indexation algorithms (see Part 2 after clicking on this link). This must be at least 20 years old. I am sure we must have re-invented the wheel here. But sometimes, re-inventing the wheel is the fastest and easiest way, and can lead to better solutions. This would definitely be classified as a machine learning technique, despite the fact that it is pure clustering, and thus a statistical technique. It is an incredibly fast clustering technique indeed: it does not require n x n memory storage, only n, where n is the number of observations. Also, it is easy to implement in distributed Map-Reduce or Hadoop environments. But it is not found in any stats textbooks that I am aware of (I would love to be proved wrong though). It is a fundamental algorithm: the core algorithm used to build taxonomies, catalogs (see this article about Amazon), search engines, and enterprise search solutions. We've used it successfully in numerous contexts including for our IoT automated growth hacking for digital publishing, to categorize our articles and boost them depending (among other things) on category, for maximum efficiency. Here's another illustration.
Who should use these techniques?
The practitioners who will benefit most include:
- start-up employees,
- statisticians, engineers or data scientists paid to develop proprietary solutions (in big or small companies, to gain a competitive advantage) especially where automated analytical systems and big streaming unstructured data are involved,
- analytics vendors / data solution providers that want to integrate powerful, simple and robust algorithms in the data science / data integration / data management products and services that they sell,
- non-statisticians (for instance software engineers) performing statistical analyses in a programming language such as Python or Java,
- and pretty much anyone working on analytical projects (even if just in Excel or SQL) who is not forced to use standard procedures from some statistical package.
Some of my techniques can be found in my Wiley book, and much more will be introduced in my upcoming Data Science 2.0 book.
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge