This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum k in k-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.
Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.
In particular, the concept of p-value is not explicitly included in this tutorial. Instead, following the new trend after the recent p-value debacle (addressed here by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples.
Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis (see here) consisting of a collection of disparate results that emphasizes the theory.
1. Re-sampling and Statistical Inference
2. Generic, All-purposes Algorithm
We are dealing with a set of N (possibly multivariate) observations, called population. We want to split it into M subsets called sub-samples, each with n observations. In each sub-sample, observations may or may not be duplicated. The total number of data points in all the sub-samples is nM. Usually, nM is less than or equal to N, however here, we allow nM to be of any size, even larger than N. The purpose is to study the empirical distribution of some statistical quantities of interest, such as mean, variance, percentiles, correlations, mean squared error, and so on. In particular, we want to determine how large the sample size N must be, in order to achieve a pre-specified level of accuracy, typically measured by the width of some confidence intervals.
In each sub-sample, observations are drawn from the population, either with or without replacement. Whenever possible, drawing without replacement is the preferred method as it leads to maximum variance reduction, with no duplicated observations. Interesting cases include:
If you want to measure some quantity T, say the mean value, you can do it using the entire population with N observations, or using the M sub-samples. The formula below is fundamental to solve most statistical inference problems in this context.
1.1. Main Result
If V denotes the variance of some estimated quantity T using the entire population, and W the variance using the M sub-samples, then:
Here s is a positive constant, and the weight w(k) represents the multiplicity of the k-th observation across the M sub-samples. It is assumed that the N observations are independently and identically distributed, and that T can be computed as a combination of sums over the observations: this is the case for the estimated mean, variance, correlation, and many other estimators. Even when these assumptions are violated, the above results can still be used as a rule of thumb, thanks to the central limit theorem. An example is discussed in section 3.
We are mostly interested in the ratio F. In particular, if sampling is done without replacement, then F = SQRT(N / (nM)). The quantity F represents the ratio of the confidence interval widths, when comparing the re-sampled estimate (numerator) versus the population estimate (denominator). Thus F is a precision indicator used to determine ideal sample sizes. The smaller F, the better. Note that F is always larger or equal to 1. Also, the ratio of two F's associated with two different re-sampling schemes can be used to determine which one is the most efficient.
1.2. Sampling with or without Replacement
We have two cases:
In both cases, the weights can be explicitly and efficiently computed at once for multiple values of n, using the algorithm in section 2. Thus, it is easy to compute F. In addition, we have the following results:
If sampling without replacement, then
If sampling with replacement, on average we have:
Also, in that case, the expected number of distinct (non duplicate) observations across the M sub-samples, is equal to
These results are easy to prove. The proof is left as an exercise, see also here.
Here we illustrate the computation with the following simple example, with N = 5, M = 2, and n = 3:
In this case, w(1) = 0, w(2) = 2, w(3) = 0, w(4) = 3, w(5) = 1.
1.4. Optimum Sample Size
The F, V and W statistics can be used to determine the minimum sample size needed to achieve the desired level of accuracy for the estimator T. The width of your confidence interval being proportional to 1 / SQRT(N), by multiplying N by a factor 4, you increase accuracy by a factor 2.
Another option to determine the sample size, especially when some of the assumptions are violated (the fact that the observations must be identically and identically distributed) consists in extrapolating the variance V or W as N increases, to find when N is large enough so that (say) V is small enough. This is discussed in section 3.
1.5. Optimum K in K-fold Cross-Validation
The number of sub-samples used in cross-validation is usually denoted as K (rather than M) and one of the main problems is to determine the optimum K. Typically, K is small, between 2 and 20. The statistics T of interest, in this case, is a goodness-of-fit metric, for instance the mean squared error for predictions computed on the control sub-sample. The model is trained on K-1 sub-samples, called training sets. One approach to this problem is to plot T as a function of K, and check when an increase in K stops producing a significant improvement (error reduction) in T. Then you reached the ideal K. This can be automated using our elbow rule algorithm.
1.6. Confidence Intervals, Tests of Hypotheses
Confidence intervals (CI) for an estimate T are easy to obtain. The first step consists of computing the empirical percentiles for T, based on M sub-samples, each with n observations. Then, a 90% CI is defined as [T.05, T.95] with
To narrow the width T.95 - T.05 of the 90% confidence interval, one must increase n and N.
To test with a 90% confidence level whether an estimate T is equal to a particular, pre-specified value t, one has to check whether t is in the 90% confidence interval [T.05, T.95].
All of this is illustrated in section 3.
2. Generic, All-purposes Algorithm
In this section we share a generic algorithm that performs most of the computations described earlier.
2.1. Re-sampling Algorithm with Source Code
This algorithm generates the M samples, each with n observations, based on the original data set with N observations. It can perform re-sampling with or without replacement, and covers all cases. In the case of re-sampling without replacement, it sequentially browses the list of N observations, incrementally building the sub-samples by adding one new observation each time. Thus each sub-sample consists of a block of n consecutive observations from the original data set. This feature is useful when processing time series data. The sub-samples contain duplicate observations if and only if nM is larger than N.
The algorithm computes the estimate T for each sample and each value of n, as well as the weights w(1), w(2), and so on, for each n. Here, T is the mean. In section 3, a different version computes the correlation for bivariate data, instead of the mean. The computations are done efficiently: the computational complexity is O(nM).
See picture below, or download the text file version here.
The above version performs re-sampling without replacement. For re-sampling with replacement, replace the line
In section 2.2. we discuss sampling without replacement, picking up observations randomly rather than sequentially.
2.2. Alternative Algorithm
If the N observations in the original population are somewhat clustered, it is better not to create sub-samples consisting of blocks of successive observations, when sampling without replacement. In this case, one can proceed iteratively as follows:
The number of trials required at iteration k, to find an observation that has not been picked up already in a previous iteration, is equal to N / (N - k + 1) on average. Thus the computational complexity of this procedure is
2.3. Using a Good Random Number Generator
Modern programming languages and even Excel provide reliable pseudo-random number generators, capable of generating up to 10^15 distinct values. This limit is due to machine precision, but it is more than enough for our purpose, especially since re-sampling methods are particularly useful for relatively small data sets.
However, Perl is a notable exception, capable of generating only 32,767 distinct pseudo-random numbers; see here for details. This is why we created our own generator. With our generator, the k-th deviate is equal to the fractional part of k^b log(k), with b = (1 + SQRT(2)) / 7. These deviates are uniformly distributed on [0, 1] and exhibit no auto-correlation. Proving the random character of these deviates is an interesting and difficult number theory problem, beyond the scope of this tutorial.
We also used our generator to create the original data set with N observations, in section 2.1.
In this section, we use a more complex data set to illustrate the concepts discussed earlier, to obtain new theoretical results, and to create new statistical recipes. The framework described here could be called statistics 2.0. One amazing feature discussed in section 3.4 is a recipe to design an estimator more accurate than the best possible estimator available for a fixed value of N, without increasing N (the number of observations) in essence seemingly working with much more information than the data set actually contains, as if N was larger than it actually is.
3.1. A Challenging Data Set
The data set consists of N bi-variate observations (x(k), y(k)) with k = 1, 2, ..., n. It is built as follows: x(k) is the fractional part of b1 * k, and y(k) is the fractional part of b2 * k, with b1 = - 1 + SQRT(5) / 2 and b2 = 2 / SQRT(5). The data (the two variables) is stored in two arrays a1 and a2, using the code below:
The data set is a realization of a bi-variate perfect process with N = 100,000 points. These processes are studied here and here. In particular, each of the two variables exhibits strong, long-range (indeed, infinite range) auto-correlations. The exact values of these auto-correlations are known, making this data set a good candidate to benchmark statistical tests. The cross-correlation T between the two variables is also known and equal to
In particular, with the values of b1 and b2 chosen here, the cross-correlation, as N tends to infinity, is equal to T = 1 / 20, see here. Pretty close to zero, but distinct from zero. The brackets in the above formula represent the fractional part function.
We make statistical inference about the cross-correlation T, using M = 20 sub-samples, each containing up to n = 5,000 points. Sampling is done without replacement, and nM = N. So there is no duplicate observation in the sub-samples. The source code, adapted from section 2.1., becomes
The source code is available in text format, here.
3.2. Results and Excel Spreadsheet
We computed the cross-correlation T for all sub-sample sizes n between n = 2 and n = 5,000, for the M = 20 sub-samples, using the code in section 3.1. We then computed (in Excel), for each n, the percentiles T.05 and T.95, as well as the width L = T.95 - T.05 of the confidence intervals, across the M sub-samples. The results are available in this spreadsheet. You can change the percentile thresholds (0.05 and 0.95) in the spreadsheet to interactively visualize the impact on the charts. These thresholds are stored in cells AC2 and AD2. The two charts of interest are shown below.
Figure 2: Upper and lower bounds of the confidence interval for T, as a function of n
The true, theoretical value of T (when N is infinite) is T = 1 / 20 = 0.05. The charts speak for themselves: they provide the confidence intervals and suggest that indeed, based on our computations, a value of 0.05 for T is highly plausible, and that T is clearly not equal to 0 (it would be zero if the two irrational bases b1 and b2 used to build our data set, were not related.) But there is much more to that, as we shall see in section 3.3. and 3.4.
Note that we did not use any statistical theory to arrive to our conclusions, not even the concept of random variable, statistical distribution, or the central limit theorem.
3.3. A New Fundamental Statistics Theorem
In the example in section 3.2, the assumptions of the central limit theorem are severely violated, in particular, the observations are not independent at all. In other cases, observations may have different variances and are not identically distributed. Yet, the width L of the confidence interval (L = T.95 - T.05, using M = 10 or M = 20) is very well approximated by a power function of n as illustrated in Figure 1. Actually, this is also true with L = T.90 - T.10, and indeed, with any percentile thresholds (that is, with any confidence level, to use the standard statistical terminology.)
In our example in Figure 1, the relationship is L = 8.781 / n^0.894 which becomes more and more accurate as n increases (see the dotted line.) Generally speaking, the relationship is L = A / n^B, where A and B are two constants that depend on your data set. This leads to the following theorem:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L = A / n^B where A and B are two positive constants depending on the data set, and n is the sample size. We discuss here the conditions required for the theorem to be valid.
The exponent B bears some resemblance with the Hurst exponent in time series, see here and here. The standard case, when the data is well behaved and satisfies the assumptions of the central limit theorem (CLT), yields B = 1/2. The constant A is linked to the estimated variance attached to a single observation, and is further discussed in this article. Any departure from B = 1/2 indicates that the data has some patterns, and that the CLT assumptions are violated. The same is true with the Hurst exponent. In some sense, the above theorem is a generalization of the CLT. Other generalizations exist, see for instance here, but the one featured in our theorem is of an entirely different nature. It can be used to very accurately determine the value of n (and thus N) needed to achieve a specific level of accuracy for an estimator T, even when the CLT assumptions are severely violated.
In our example, we picked up T.05 and T.95, as opposed to (say) T.25 and T.75, because the thresholds 0.05 and 0.95 provide a better fit with the power function, and especially, a more symmetrical confidence interval. Finally, L is the difference between two values of the empirical percentile distribution for T. This distribution is known to converge to that of a normal distribution under certain conditions, and this fact can be used if you are interested in digging into the mathematical details. Note that T.95 and T.05 are not independent random variables.
You can use our theorem on smaller data sets, for instance with M = 10 and n = 2,000, that is, N = 20,000 observations.
3.3. Some Statistical Magic
With N between 98,000 and 100,000 observations, the value for the cross-correlation estimator T discussed in section 3.1 and 3.2 is quite stable, and oscillates between 0.049962093 and 0.050143847, with an average of 0.050039804. The exact value (when N is infinite) is 0.050000000. There is an intuitive way to get a much more precise estimate without increasing the sample size, and this applies to any data set and any estimator.
Let us look at the value computed on each of the 20 sub-samples, when n is between 2,500 and 5,000, using increments of 1,000 in n. In particular, let us focus on the 5% and 95% percentiles (T.05 and T.95). The median value computed on these 52 percentile data points is 0.049992500.
Note that in practice, the sample size N is determined in advance. You have to assume that the best possible estimate is obtained by taking all the 100,000 observations into account. In our case, this yields -- by pure chance -- an unexpectedly very good value of 0.050011877, more accurate than (say) with N = 99,999 or 99,998, but less accurate than with N = 98,900 (with N = 98,900 the estimated value is 0.050002867 and it is one of the very best that you can get from the data set.) Of course, in practice, on a real data set, there is no way to known that N = 98,900 yields a more accurate value than N = 100,000, and you won't know if N = 100,000 works better or not than N = 99,999. By contrast, the technique described in the previous paragraph is replicable, and also provides an incredibly accurate value.
Let us summarize our findings:
The error reduction factor with our trick is | 0.050039804 - 0.050000000 | / | 0.049992500 - 0.050000000 | = 5.3. If you were to get that kind of improvement simply by increasing the sample size, you would need a sample size about 5.3 x 5.3 = 28 times bigger. Our trick essentially gives you one extra digit of accuracy, without increasing N.
For a related method that accomplishes similar results, download this PDF presentation (originally posted here) by Nathaniel E. Helwig, entitled Bootstrap Confidence Intervals, and look for second-order accurate intervals starting at slide 29.
How does this work?
We have simply smoothed out little variations in the estimated value around N = 100,000, using an high-pass filter to sharpen the signal. This is similar to using an high-pass filter in image processing to remove noise and increase sharpness. This type of filter (in image processing) also uses medians rather than averages. Averages are actually used to do the opposite effect: blur the image, and the filter is then called a low-pass filter. Also, because at N around 100,000, the estimated values are mostly above the exact value, using medians that include T.05 allows you to correct for the tiny bias. This would also work if the opposite was true, that is, if the values were mostly below the exact value, thanks to using T.95 as well. And this feature (the tiny bias) is present in any data set, when looking at short windows such as N between 98,000 and 100,000.
Does this contradict entropy principles?
It does not contradict entropy principles, not more than accuracy boosts obtained by removing outliers (thus reducing the sample size) that do better than increasing the sample size.
The basic principle is that the insights you get from a data set are based on the amount of information that it contains. If you use all the information in your data set (and the standard estimator based on the N observations does that) then there is no way you can get anything better unless you increase the sample size. This is true in theory, but not always in practice: removing errors is a counter-example, with an error-free data set seemingly containing more information than if errors were added to it.
However, in compliance with the entropy principle, over sub-sampling (creating sub-samples with duplicated observations, resulting in nM larger than N) in hopes of getting more accurate estimators, would not solve the problem. We have established this fact in section 1.1, proving that F is always above 1, or equal to 1 if and only if all the information in the data set is used. Whether it is used only once or duplicated, does not change the fact that F can not be lower than 1. Thus the confidence level can not be improved that way.
It is sometimes said that data science needs statistics to make things work. Here it is the other way around, with statistics benefiting from mathematical discoveries arising from applied data science research, to improve existing statistical methods.
In this article, we discussed a new way to process, analyze and extrapolate data sets, leading to a new fundamental statistical theorem, and a way to find more information in a data set, than what traditional entropy and statistical theory suggest. It allows you to increase the accuracy of statistical estimators without increasing the sample size. The boost in accuracy is equivalent to increasing the sample size by a factor 25. The magic in this technique is not more spectacular than the magic used to enhance blurred images and make them look perfect. Indeed, these two extrapolation techniques are closely related.
Under the umbrella of re-sampling, many statistical problems are solved in a simple way, ranging from optimizing cross-validation experiments to designing sound tests of hypotheses when traditional assumptions imposed on the data set are severely violated.
All of this is discussed without using even basic statistical concepts such as random variable, p-value, or statistical distribution, making the material not only accessible to the layman, but also easy to integrate in black-box machine learning systems.
To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.