You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many experienced data scientists with a PhD degree have not heard about some of these concepts.
1. Random walks in one, two and three dimensions
This is a well known model, used as a base stochastic process to model the logarithm of stock prices, yet it has interesting properties (depending on dimension) that few people know about. In one dimension, it is described as follows: You start at 0 (on the X-axis) and at each iteration, you increase by +1 with probability 0.5, and decrease by +1 with probability 0.5. In one or two dimensions, the probability that it will get back to any previous state at one point, is one. But this is not the case in three dimensions. Yet the most probable number of sign changes (crossing the X-axis) in a walk is 0, followed by 1, then 2, etc. The time spent either above or below the X-axis (before a crossing) is modeled by the arc-sine law: Crossing the X-axis happens rarely. For self-correcting random walks, click here. Below is a simulation of a 2-D random walk; The video was produced with R.
2. Estimation of the convex hull of a set of n points
In one dimension, this is just the estimation of an interval when points are uniformly distributed, using the minimum and maximum observations, and multiplying the observed length (max - min) by a factor (n+1)/n to remove the bias. In two dimensions, computing the convex hull is easy, and again you need to expand the shape a little to correct for bias. Convex hulls are used in clustering problems, where clusters are modeled by (possibly) overlapping convex domains: This is a non-parametric alternative to clustering algorithms based on the Gaussian distribution.
A potential application is estimating the shape of an oil field when digging a number of test wells - some within the (unknown) oil field boundary, some (as few as possible) outside the boundary. It is also used to estimate the extent and shape of an underground contaminated area: It was used to identify whether the nuclear waste from the Hanford nuclear reservation, was spilling in the Columbia river located a few hundred yards away, and whether it got worse over time, by measuring chromium levels in a number of wells.
How about designing a fast algorithm to compute the convex hull of a set of points, in any dimension? This is a great exercise for a data scientist, but first you need to check the literature about existing algorithms. I implemented one when I was working on my PhD in computational statistics.
The first step to estimate this complex shape is to start with the convex hull (click here for details)
3. Constrained linear regression on unusual domains
Lasso and ridge regression are popular examples of constrained linear regression: Constraints are put on the regression coefficients to make it more stable, for instance, the coefficient between a dependent and independent variable must have the same sign as the correlation between the two variables in question. Such constraints are used for instance in the HDT algorithm, which is an hybrid regression / pseudo decision tree procedure.
In some cases, the constraints are dictated by the business problem itself. For instance, if a response depends on a mix of chemical ingredients (think about the taste of a beverage - how people like it or not) the weight or proportion attached to each ingredient is a regression coefficient: All these coefficients must be positive or zero, and they must add up to one. This is known as linear regression on the simplex domain. Click here for more similar problems (regression on a sphere and so on.)
4. Robust and scale-invariant variances
The traditional variance is impacted by erroneous data and outliers, and thus not very robust. I proposed a new variance that is more robust, and always positive, just like the standard variance. The positivity is guaranteed by the Jensen inequality, and from a mathematical point of view, it is a metric between an L^1 and L^2 version of the classical variance (L^2 yields the classical variance.) Click here for details.
I am currently working on a variance that is scale-invariant (also described in the same article) and this is really a bizarre object, though useful when the variance should stay the same, whether your metric is measured in miles or kilometers. The next step is to design scale-invariant clustering algorithms, as the scale of each variable (the units used for measurement) sometimes have a bigger impact on the resulting clusters, than the choice of the clustering algorithm itself.
5. Distribution of arrival times of extreme events
Most of the articles on extreme events are focusing on predicting the extreme values. Very little has been written about the arrival times of these events. This article fills the gap. Click here to read it. It comes with a pre-computed table of probabilities for the occurrences of extreme events. Think about floods: While it is important to correctly predict the maximum intensity of potential floods, predicting when they can happen, over a 1000-year time period, is equally important. Departure from the theoretical model means that patterns are changing, thus the need to work with statistical tables such as mine, as actuaries do.
6. The Tweedie distributions
In statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal and gamma distributions, the purely discrete scaled Poisson distribution, and the class of mixed compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Just like the exponential family of distributions, it includes several popular distributions. These distributions are characterized by the following property: The expectation is proportional to a power of the variance. It has many applications, including for modeling errors in signal processing, and even to model departure from the asymptotic representation in some prime number functions. Click here for details, and to see the various applications, including actuarial studies, survival analysis, ecology, medical applications, meteorology and climatology, fisheries, cancer metastasis, genomic structure and evolution.
Another distribution with several practical applications is the Zipf distribution.
7. The arithmetic-geometric mean
This was initially designed to compute the mean of two numbers, and it comes with a very fast algorithm that converges to a value between the arithmetic and geometric means. It has a number of interesting mathematical properties, and has been used to compute the number Pi very efficiently (other very fast algorithms to compute Pi can be found here and here.)
To compute the arithmetic-geometric mean of two numbers, start with two initial estimates a(0) and b(0) equal respectively to the geometric and arithmetic mean. At each iteration k, compute a(k) as the geometric mean of a(k-1) and b(k-1), and compute b(k) as the arithmetic mean of a(k-1) and b(k-1). Both a(k) and b(k) converge very fast to the arithmetic-geometric mean. Click here for details.
It has been generalized to any number of variables, see here. The picture below summarizes one of the most interesting generalizations, involving a bunch of interesting averaging functions, besides the arithmetic and geometric means.
8. Weighted version of the K-NN clustering algorithm
To estimate the local or global intensity of a stochastic point process, and also related to density estimation techniques, is: How many neighbors should we use, and which weights should we put on these neighbors to get robust and accurate estimates? It turned out that putting more weight on close neighbors, and increasingly lower weight on far away neighbors (with weights slowly decaying to zero based on the distance to the neighbor in question) was the solution to the problem. I actually found optimum decaying schedules for the weights a(k) attached to the k-th nearest neighbor, as k tends to infinity. You can read the details here. Obviously this can also be used when implementing clustering techniques based on the well known K-NN algorithm (k nearest neighbors.)
For another generalization of the K-NN classifier, based on graph theory, click here. This version of K-NN can also be used for variable reduction while preserving the dimension of the original data set.
9. Multivariate exponential distribution and storm modeling
Intensity and duration of storm cells have been traditionally modeled using Gaussian distributions. Bivariate exponential distributions with negative correlation provide more flexibility and a better representation of the real world, that is, superior goodness of fit with actual data. You can read more about this topic, and about how to simulate a multivariate exponential distribution with specific covariance matrix and known marginals, here (PDF document.)
There is a limit on how negative the coefficient of correlation of a bivariate exponential distribution can be, and this is pictured in the theorem below (from the same paper):