In this article, we revisit the most fundamental statistics theorem, talking in layman terms. We investigate a special but interesting and useful case, which is not discussed in textbooks, data camps, or data science classes. This article is part of a series about off-the-beaten-path data science and mathematics, offering a fresh, original and simple perspective on a number of topics. Previous articles in this series can be found here and also here.
The theorem discussed here is the central limit theorem. It states that if you average a large number of well behaved observations or errors, eventually, once normalized appropriately, it has a standard normal distribution. Despite the fact that we are dealing here with a more advanced and exciting version of this theorem (discussing the Liapounov condition), this article is very applied, and can be understood by high school students.
In short, we are dealing here with a not-so-well-behaved framework, and we show that even in that case, the limiting distribution of the "average" can be normal (Gaussian.). More precisely, we show when it is and when it is not normal, based on simulations and non-standard (but easy to understand) statistical tests.
Figure 1: The first three cases show convergence to the Gaussian distribution
(click on picture to zoom in)
1. A special case of the Central Limit Theorem
Let's say that we have n independent observations X(1),..., X(n) and we compute a weighted sum
S = a(1) X(1) + ... + a(n) X(n).
Under appropriate conditions to be discussed later, (S - E(S) ) / Stdev(S) has a normal distribution of mean 0 and variance 1. Here E denotes the expectation and Stdev denotes the standard deviation, that is, the square root of the variance.
This is a non-basic case of the central limit theorem, as we are dealing with a weighted sum. The classic, basic version of the theorem assumes that all the weights a(1),...,a(n) are equal. Furthermore, we focus here on the particular case where
The surprising result is the fact that even with putting so much weight on the first few observations, depending on how slowly a(k) converges to 0, the limiting distribution is still Gaussian.
About the context
You might wonder: how is this of any practical value in the data sets that I have to process in my job? Interestingly, I started to study this type of problems long ago, in the context of k-NN (nearest neighbors) classification algorithms. One of the questions, to estimate the local or global intensity of a stochastic point process, and also related to density estimation techniques, was: how many neighbors should we use, and which weights should we put on these neighbors to get robust and accurate estimates? It turned out that putting more weight on close neighbors, and increasingly lower weight on far away neighbors (with weights slowly decaying to zero based on the distance to the neighbor in question) was the solution to the problem. I actually found optimum decaying schedules for the a(k)'s, as k tends to infinity. You can read the detail here.
2. Generating data, testing, and conclusions
Let's get back to the problem of assessing when the weighted sum S = a(1) X(1) + ... + a(n) X(n), after normalization, converges to a Gaussian distribution. By normalization, I mean considering (S - E(S) ) / Stdev(S) instead of S.
In order to solve this problem, we performed simulations as follows:
Simulations
Repeat m = 10,000 times:
Each of the above m iterations provides one value of the limiting distribution. In order to investigate the limiting distribution (associated with a specific set of weights), we just need to look at all these m values, and see whether they behave like deviates from a Gaussian distribution of mean 0 and variance 1. Note that we found n = 10,000 and m = 10,000 to be large enough to provide relatively accurate results. We tested various values of n before settling for n = 10,000, looking at what (little) incremental precision we got from increasing n (say) from 500 to 2,000 and so on. Also note that the random number generator is not perfect, and due to numerical approximations made by the computer, indefinitely increasing n (beyond a certain point) is not the solution to get more accurate results. That said, since we investigated 5 sets of weights, we performed 5 x n x m = 500 million computations in very little time. A value of m = 10,000 provides about two correct digits when computing the percentiles of the limiting distribution (except for the most extreme ones), provided n is large enough.
The source code is provided in the last section.
Analysis and results
We tested 5 sets of weights, see Figure1:
Note that by design, all normalized S's have mean 0 and variance 1.
We computed (in Excel) the percentiles of the limiting distributions for each of the five cases. Computations are found in this spreadsheet. We compared the cases 2 to 5 with case 1, computing the differences (also called deltas) for each of the 100 percentiles. Since case 1 corresponds to a normal distribution, we actually computed the deltas to the normal distribution, see Figure 2. The deltas are especially large for the very top or very bottom percentiles in cases 4 and 5. Cases 2 and 3 show deltas close to zero (not statistically significantly different from zero), and this is expected since these cases also yield a normal distribution. To assess the statistical significance of these deltas, one can use the model-free confidence interval technique described here: it does not require any statistical or probability knowledge to understand how it works. Indeed you don't even need a table of the Gaussian distribution for testing purposes here (you don't even need to know what a Gaussian distribution is) as case 1 automatically provides one.
The Liapounov connection
For those interested in the theory, the fact that cases 1, 2 and 3 yield convergence to the Gaussian distribution is a consequence of the Central Limit Theorem under the Liapounov condition. More specifically, and because the samples produced here come from uniformly bounded distributions (we use a random number generator to simulate uniform deviates), all that is needed for convergence to the Gaussian distribution is that the sum of the squares of the weights -- and thus Stdev(S) as n tends to infinity -- must be infinite. This result is mentioned in A. Renyi's book Probability Theory (Dover edition, 1998, page 443.)
Note that in cases 1, 2, and 3 the sum of the squares of the weights is infinite. In cases 4 and 5, it is finite, respectively equal to Pi^2 / 6 and Pi^4 / 90 (see here for details.) I am very curious to know what the limiting distribution is for case 4. Can anyone provide some insights?
3. Generalizations
Here we discuss generalizations of the central limit theorem, as well as potential areas of research
Generalization to correlated observations
One of the simplest ways to introduce correlation is to define a stochastic auto-regressive process using
Y(k) = p Y(k-1) + q X(k)
where X(1), X(2), ... are independent with identical distribution, with Y(1) = X(1) and where p, q are positive integers with p + q =1. The Y(k)'s are auto-correlated, but clearly, Y(k) is a weighted sum of the X(j)'s (1<= j <= k), and thus, S = a(1) Y(1) + ... + a(n) Y(n) is also a weighted sum of the X(k)'s, with higher weights on the first X(k)'s. Thus we are back to the problem discussed in this article, but convergence to the Gaussian distribution will occur in fewer cases due to the shift in the weights.
More generally, we can work with more complex auto-regressive processes with a covariance matrix as general as possible, then compute S as a weighted sum of the X(k)'s, and find a relationship between the weights and the covariance matrix, to eventually identify conditions on the covariance matrix that guarantee convergence to the Gaussian destribution.
Generalization to non-random (static) observations
Is randomness really necessary for the central limit theorem to be applicable and provable? What about the following experiment:
Do these 2^n normalized values of S (generated via this non-random experiment) follow a Gaussian distribution as n tends to infinity? Ironically, one way to prove that this is the case (I haven't checked if it is the case or not, but I suspect that it is) would be to randomly sample m out of these 2^n values, and then apply the central limit theorem to the randomly selected values as m tends to infinity. Then by increasing m until m is as large as 2^n we would conclude that the central limit theorem also holds for the non-random (static) version of this problem. The limiting distribution definitely has a symmetric, bell-like shape, just like the Gaussian distribution, though this was also the case in our above "case 4" example -- yet the limit was not Gaussian.
Other interesting stuff related to the Central Limit Theorem
There is a lot of interesting stuff on the Wikipedia entry, including about the Liapounov condition. But the most interesting things, at least in my opinion, were the following:
Another potential generalization consists of developing a central limit theorem that is based on L^1 rather than L^2 measures of centrality and dispersion, that is, the median and absolute deviations rather than the mean and variance. This would be useful when the observations come from a distribution that does not have a mean or variance, such as Cauchy.
Also, does the limit distribution in case 4 depend on the distribution of the X(k)'s -- in this case uniform -- or is it a universal distribution that is the same regardless of the X(k)'s distribution? Unfortunately, the answer is negative: after trying with the square of uniform deviates for the X(k)'s, the limit distribution was not symmetric, and thus different from the one obtained with uniform deviates.
4. Appendix: source code and chart
Below is the source code (Perl) used to produce the simulations:
$seed=100;
$c=-0.5; # the exponent in a(k) = k^c
$n=10000;
open(OUT,">out.txt");
for ($m=0; $m<10000; $m++) {
$den=0;
$num=0;
$ss=0;
for ($k=1; $k<=$n; $k++) {
$r=rand();
$aux=exp($c*log($k)); # k^c
$num+=$r*$aux;
$den+=$aux;
$ss+=($aux*$aux);
}
$dev=$num/$den;
$std=sqrt(1/12) * (sqrt($ss)/$den); # 1/12 for Uni[0,1]
$dev2=($dev-0.5)/$std;
print OUT "$m\t$dev2\n";
}
close(OUT);
Also, Figure 2 below is referenced earlier in this article.
Figure 2: Two of the weight sets clearly produce non-Gaussian limit distributions
Top DSC Resources
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
@Matthew - My answer is no based on my investigations, but it is worth double-checking.
What about a(k) = 1/k^(3/4)? Would that converge to Gaussian?
@Carlos, I know that the Cauchy distribution doesn't have any moments, and there are other exceptions. It does not invalidate my article. Instead it makes it even more interesting as I propose to investigate a CLT where normalization is done via L1 metrics such as the median (the Cauchy distribution has a median.).
Also I know that the NN distribution is exponential. But you haven't read my paper: sums of exponential converge to a Gaussian once normalized. I never claimed that distances to neighbors, for an homogeneous Poisson process, are Gaussian (that would imply that some distances are negative.) Indeed my paper is not even about homogeneous Poisson process, but processes with a radial intensity - the homogeneous case is a particular version, and for the non-homogeneous case, distances do not follow either Gaussian or exponential distributions.
Finally, like Mark wrote, for CLT independence and identically distributed is NOT necessary.
@Carlos Aya THE Classical CLT requiring existence of mean and variance, and independent and identically distributed variables, requires all those things. There absolutely are CLTs which do not require variables be independent or identically distributed. In fact, there are CLTs for some cases in which not only does a mean not exist, but in which not even a 1 - epsilon moment exists - I proved such a CLT as a homework assignment in graduate school.
That said, convergence to a Normal can be extremely slow for some of the more exotic CLTs. Sample size in the millions might be required to get as close an approximation as sample size 30 would be for a nice CLT case. And where is the approximation going to be particularly poor? In the tails, which is what are used in many of the most crucial probability calculations,such as risk..
The Fundamental Data Science Theorem: "Oversimplification and hype never go out of style". Approximately having a normal distribution of mean 0 and variance 1 is not the same as having a normal distribution of mean 0 and variance 1. Indeed, the approximation can be quite poor, especially in the tails, even for rather large values of n. Why should we care about the tails of the distribution? Because that's exactly what some of the most important life and death calculation are based on. Risk analysis, whether for a potentially catastrophic failure, as with example a nuclear power plant, or the ability of a financial institution or other company to withstand adverse business or larger economy circumstances, depends crucially on the tails of the distribution. Oversimplified analysis based on Normal approximation, not to mention not accounting for dependency of random variables, often results in underestimation of serious consequence risks by 2 or more orders of magnitude. So yes, high school kids can understand the Central Limit Theorem at some level, data science internship and bootcamp graduates can understand it at some level, a level sufficient to lead them to make risk assessments which are off by orders of magnitude.
Big Data, Data Science and their tools continue to advance.at a breathtaking rate. Spurious Correlations per sec (SPcs) increases with each new generation of tools and analysts. I hereby state on the record, the existence of Moore's Law for Spurious Correlations per second (SCps). Mark L. Stone - Dec 2016.
What can be applied aspects of your article?
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central