.

# Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism  to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non normal, and even non stationary.

No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers.

Power curve fitting: see here

In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as L = A / n^B and we compute, using Excel alone, a 95% CI for B in the classic case where B = 1/2. We also exhibit an artificial data set where L = 1 / (log n)^Pi. Here n is the sample size.

Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology.

Contents

1. Principle

2. Examples

• Estimator used in nearest neighbors clustering
• Weighted averages when dealing with outliers
• Correlation coefficient estimated via re-sampling
• Auto-correlated time series, U-statistics

3. Counterexamples

4. Estimating A

5. Estimating B

• Getting more accurate values
• Getting even more accurate values

5. Theoretical Background

• Connection with the re-scaled range and the Hurst exponent
• General case
• Another approach to building confidence intervals

6. Conclusions

1. Principle

We have tested our methodology in cases that are challenging when using traditional methods, such as a non-zero correlation coefficient for non-normal bi-variate data. Our technique is based on re-sampling and on the following, new fundamental theorem:

Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L = A / n^B where A and B are two positive constants depending on the data set, and n is the sample size. The requirements for the theorem to be valid are discussed in section 6.

The standard (textbook) case is when B = 1/2. Typically, B is a function of the intrinsic, underlying variance attached to each observation, in your data set. Any value of B larger than 1/2 results in confidence intervals converging faster than what traditional techniques are able to offer. In particular, it is possible to obtain B = 1. Departure from B = 1/2 can be caused by unusual data (see sections 2.3, 2.4, and 6.3) or by a-typical estimators (see section 2.1 and 2.2.) Indeed, testing whether B = 1/2 can be useful to check if your data has hidden patterns such as uneven variances.

Our new technique is described in details (with source code, spreadsheet and illustrations) in our previous article, here. When reading that article, you may skip section, 1, and focus on section 2, and especially section 3, where all the results are presented.

2. Examples

We provide here a few examples where the exponent B is different from 1/2.

2.1. Estimator used in nearest neighbors clustering

An example of a non-standard case is the following. In the context of supervised classification, one sometimes uses a function of the distances between a point x outside of the training set, and its n nearest neighbors in the training set, to decide which cluster x should be assigned to. This function uses decaying weights, with the highest weights attached the the closest neighbors within a same cluster. Depending on how fast these n weights decay, the resulting cluster density estimators measured at location x, may have an exponent B different from 1/2. Also, if the confidence intervals attached to two or more clusters overlap, it means that x could belong to any of these clusters.

2.2. Weighted averages when dealing with outliers

One way to eliminate or reduce the impact of outliers, when estimating the mean T, is to use weighted averages with positive weights, the weight attached to each observation being a decreasing function of the distance between the median and the observation in question. It is defined as follows:

We assume here that the n independently and identically distributed observations are ordered according to their distance to the median, with the closest one corresponding to the first term in the sum. If b < 1/2, then B = 1/2 as in the standard case. An example is when all weights are identical (b = 0). If b is in ]0.5, 1[ then it is easy to prove (see here) that B = 1 - b.The same is true for other estimators, not just the mean.

The general rule is as follows. If you use weights w(k) decaying more slowly than 1 / SQRT(k), then B = 1/2. If the weights decay faster than 1 / SQRT(k) but more slowly than 1 / k, then you still end up with a power function, but B is strictly between 0 and 1/2. More specifically, with the notation

the exponent B is equal to

In particular, s(n) is minimum and equal to 1 / SQRT(n) if all the weights w(k) are identical. In that case, B = 1/2.

2.3. Auto-correlated time series, U-statistics

We provide here an example where B = 1. In a time series X(1), X(2), ..., the most extreme case (producing the highest value for B) is when X(k + 1) depends solely on X(k). This is actually the case in the example discussed in section 6.3.

For instance, in the artificial case where X(k + 1) = a X(k) + b with |a| < 1, the mean (or any other estimator) is a function of X(1) and the constants a and b alone. The variance of the mean is

and thus B = 1. The same upper bound B = 1 can be achieved with actual (non-degenerate) stationary auto-regressive time series, see the answer to a question I asked on CrossValidated.com, here

The problem with time series is that if you re-shuffle the observations, you lose the auto-correlation structure, and thus B may revert back to B = 1/2. Can you find an estimator that keeps a value of B higher than 1/2 even if you reshuffle the observations? The answer is positive if you consider U-statistics: they provide estimators that can converge faster (B > 1/2) to the true value, than standard estimators. See this paper (originally posted here), especially the sentence above example 3.11 on page 3, and this article. The most well-known U-statistic is Gini's mean difference, defined as

with p = 1. While more efficient than traditional dispersion estimators, its B exponent is also 1/2.  The interesting case is when p tends to infinity: then G(n) is the range (maximum minus minimum observation). If the observations are uniformly and independently distributed on [0, 1], then the range has a Beta(n - 1, 2) distribution, thus its variance is 2(n - 1) / [(n + 2) (n + 1)^2] and B = 1. It would be interesting to see what happens if the observations have a fat tail distribution instead. If the distribution is exponential, B = 0 (see here for the proof.) If the distribution is Gaussian, B = 1/2 (see here.)

2.4. Correlation coefficient estimated via re-sampling

Below is an illustration for the correlation coefficient using a bi-variate artificial data set that simulates somewhat random observations. The illustration in this section is based on re-sampling, using the approach discussed in sections 2 and 3 in this article.

The resulting value of B is 0.46. The boosting technique (to improve B) has not been used here, but it is described and illustrated in section 3.3, in the article in question. The power function mentioned in our above theorem, fitted to this particular data set, is represented by the red, dotted curve, in the top chart below.

The simulated data set was built as follows:

More examples with more details, can be found here. One way to get more accurate values for A and B is to re-do the same computations using 10 different re-ordering of the data set, by randomly shuffling the observations, then averaging A and B across these 10 sets: see discussion in section 5.

3. Counterexamples

The theorem is very general, and applies to most data sets. Exceptions consist of odd, artificially manufactured data sets that are not found in business applications, or non-stationary processes such as Brownian motions (see also chapter 2, here.)

For instance, the data set created in my new spreadsheet (download it here) consists of 10,000 observations: the first 5,000 are assigned to sample x[1], and the remaining ones to sample x'[2] (respectively column A and D in the spreadsheet). Observations in each sample are independently and uniformly distributed, and the correlation between the two samples is zero. Yet the confidence intervals for the mean have a width L(n) of the form A / (log n)^B, here with A = 1 and B = Pi: see column J. You can change the fitting curve (column K) as well as the values of the observations in x[1], and keep everything else unchanged, and get whatever curve you want for L(n), even a trigonometric function. This is possible only because the data has been created to make this happen: while the first-order cross-correlation between the two samples is nil, the second-order cross-correlations are extremely high, see cell P9.

Usually, with these counterexamples, if you randomly sort the data set, re-compute the estimator and its width L on the reshuffled data for various n, the fitted curve for the width of the interval will be a totally different function. That's actually how you recognize that these data sets are artificial. However, even with real-life data, you are sometimes faced with data glitches that produce the same issues, and that are hard to detect as in my above example.

Yet there are some estimators that truly do not have a power function for the width of their confidence interval. An example is provided in my article on Poisson processes. It is about a local estimator of the intensity of a Poisson process based on the distances to the n nearest neighbors. The width L is asymptotically equivalent to log n / SQRT(n). See here, or download the article: this estimator is described page 118, just above Remark 1.

Our main theorem can be generalized as follows to cover even more cases, using a second order approximation:

The constant C may be positive or negative. Even then, the logarithm of the width L is asymptotically equivalent  to a curve with only two parameters: log A and B.

4. Estimating A

The constant A attached to the power function (see theorem), is related to the intrinsic variance present at the individual observation level, in your data. We provide its value for common estimators (up to a factor that depends only on the confidence level), in the ideal case when observations are independently and identically distributed with an underlying normal distribution. These values can still provide a good approximation in the general case.

The formula for the median is a particular case of the p-th quantile with p = 0.5. All the values in the second column represent the unknown theoretical value of the quantities involved. In practice, these values are replaced by their estimates computed on the data set. These estimates converge to the true values, so using one or another does not matter, as far as the correctness of the table is concerned.

The exact formula for the correlation when it is not zero and the normal assumption is not satisfied, is very complicated. See for instance this article. By contrast, dealing with any correlation (or even more complicated estimators such as regression parameters) is just as easy as dealing with the mean, if you use our methodology. Even if none of the standard assumptions is satisfied.

Finally, A, B and n provide more useful information about your estimator, than p-values.

5. Estimating B

We denote as L(n) the value of the interval width computed on a sample with n observations. The standard way to fit L(n) with the power curve A / n^B is not very efficient, resulting in high volatility. Here we offer strategies to get much more accurate and stable values for A and B. All the illustrations in this section are based on re-sampling, using the approach discussed in sections 2 and 3 in this article.

5.1. Getting more accurate values

The following strategies significantly improve the accuracy when estimating B:

• Use 10 different re-ordering of the data set, by randomly reshuffling the observations, then average B across these 10 sets
• Focus on small values of the sample size (less than n = 10,000.)
• When fitting L(n) with A / n^B, do not use all the values of n, but only a fraction of them, that are unequally spaced: for instance n = 4, 9, 16, 25, 36, 49, 64 and so on. Cubes work even better if your samples are large enough.
• Large confidence intervals, with lower and upper bounds equal to the 2.5 and 97.5 empirical percentiles, work better than smaller ones. Also, M = 20 or M = 10 (see algorithm in section 3 in this article) work better than M = 2 or M = 5. Large values of M (say M = 50) do not offer extra benefits.

We applied these strategies to compute a confidence interval for B, on a data set consisting of 4 non-overlapping samples (M = 4), each with n = 2,500 observations, drawn from a population of N = 10,000 observations. We repeated this procedure 10 times, by re-shuffling the 10,000 observations 10 times in the original data set. The value obtained for B is 0.484, while the theoretical value (with an infinite sample) would be 0.500. The data set, computations, and results are available in this spreadsheet (10 MB.)

5.2. Getting even more accurate values

The estimated values for B are very volatile due to the fact that L(n) is computed recursively based on embedded samples of increasing sizes. Indeed, { L(n) } is a martingale. Instead of trying to fit L(n) with a power curve, you can fit the much smoother integrated L(n), denoted as J(n), with an appropriate curve, then estimate B using J(n) rather than L(n). It turns out that J(n) is also well approximated by a power curve with the same B, at least as a first order approximation. If for the theoretical (exact) value, L(n) = A / B^n, then we have:

Note that B is in ]0, 1[, and in most cases B = 1/2. If you ignore the term B / N in the above formula, then you also have a power curve for J(n). If you include that term, your estimation will be more accurate, but the model fitting technique (to find B) is a tiny bit more tricky.

Instead of using J(n), you could use the median value of L(n) computed on the first n observations in your sample. This median is denoted as Q(n). It provided the most accurate results for B, among the four methods tested.

We tested the four methods -- L(n), approximated J(n), bias-corrected J(n) denoted as J*(n), and Q(n) -- using M = 2 samples each with up to n = 10,000 observations, reshuffling the samples 50 times to obtain the 50 x 4 values of B shown in the above figure. The estimated value of B, computed over the 50 tests using Q(n), is 0.49. The data set was designed so that the theoretical B should be the classic 1/2 value as n tends to infinity. The above figure gives you an idea of the confidence intervals for B. Note that the method based on L(n) is by far the worst, with bias and high volatility, including 5 outliers (very low values of B) not shown in the figure.  All the details, with additional comments, are found in this spreadsheet.

6. Theoretical Background

We first compare the asymptotic formula for the re-scaled range, based on the Hurst exponent H, with our asymptotic formula for L(n), based on the B exponent. We then explain how the formula L(n) = A / B^n can be derived under general assumptions, using some heuristics. All the illustrations in this section are based on re-sampling, using the approach discussed in sections 2 and 3 in this article.

6.1. Connection with the re-scaled range and the Hurst exponent

We assume here that the number of samples is M = 2. Given a sample with n observations, the re-scaled range, in its simplest form, is defined as the range of the observations (the difference between the maximum and the minimum) divided by the standard deviation computed on the n observations, see here (or alternatively here) for details.

Also, if M = 2, the width of the confidence interval L(n) is proportional to |T'(n) - T''(n)| where T'(n) and T''(n) are the values of the estimator of interest computed for each sample of size n. Let us denote as p the proportion in question, measuring the level of the confidence interval, so that

Under some general conditions, T'(n) and T''(n) have a Gaussian distribution. Thus, L(n) has a folded normal distribution, and its expectation is

Now, let's introduce the following notations:

For instance, if the estimator T in question is the mean, then Z'(k) is the k-th observation in the first sample. Now we can write the well known asymptotic expansion for the re-scaled range, as

where H is the Hurst exponent. In the above formula, if you replace V'(n) by Var[T'(n)] and R(n) by a constant R, it becomes

Since the two samples are independent, Var[T''(n)] ~ Var[T'(n)] and thus H = B

6.2. General case

We assume here that we have M samples, say M = 20 (a small number.) Let us assume that T'(n), T''(n) and so on, computed on each sample, are also independent with asymptotically (as n tends to infinity) the same variance denoted as Var[T(n)]. Then the (say) 95% confidence interval for T has a width L(n) proportional to the range of its values computed on the M samples. Let p be the proportion in question, depending on the confidence level. In short, asymptotically, E[L(n)] is the expectation of the range of M independent Gaussian variables with same mean E[T(n)],  and same variance Var[T(n)], multiplied by p. Thus, E[L(n)] ~ A / n^B is still proportional to SQRT(Var[T(n)]).

Intuitively, when the observations have strong, long-range auto-correlations, the variance of any estimator -- which is itself an increasing function of the variance in the observations -- is small, and thus B is high. See this spreadsheet for such an example, with B = 0.8.

6.3. Another approach to building confidence intervals

This framework suggests yet another potential approach to estimating B and obtaining confidence intervals for T:

• Compute T(n) and its empirical percentiles on M independent samples of size n, with increasing values of n. Instead of independent samples, you could use M reshuffled versions of one sample if you don't have much data, but you need to be careful about biases.
• Compute the mean, empirical percentiles, and variance of T(n) across the M samples, for several values of n. The mean and empirical percentiles will give you confidence intervals (CI) of any level for T, for different values of n. You can extrapolate these CI's to any value of n, using model fitting techniques. If you do the same with the empirical variance of T(n), fitting it with a power curve A / n^B, you then get an estimate for A and B.

If you use this approach, you can use sample sizes that are smaller, but the number M of samples must be large enough, at least M = 20. If your samples come from just one sample set that you reshuffle M times, you still need to use a large sample size but focus on nested sub-samples of growing sizes that are not too large, as all your estimated values across all samples, will be identical once you reach the full sample size.

The computations and results, using this approach, are found in this spreadsheet. The data set used here also has an high B exponent, above 0.8. You would expect to find patterns in the data with such a high B, and this is the case here: look at column Y in the Details tab. Below are the model-free 90% confidence intervals for various values of n. The estimator T investigated here is the mean. Its true theoretical value is 0.5. Standard statistical techniques would not work here due to the long-range auto-correlations in the data. In particular, convergence to the theoretical value is much faster than with standard techniques applied to standard data corresponding to B = 1/2.

The iterative computation of the 10,000 medians Q(n) is very slow. If you clear column S in the Details tab in the spreadsheet, computations will run much faster, but you will miss the estimate of B based on these medians. Finally, if you use one sample from the previous spreadsheet, and reshuffle it M times to produce M samples (as opposed to using non-overlapping samples as in the previous spreadsheet) then you kill many of the patterns present in that data set, and as a result your estimated B is much closer to 1/2, and the speed at which confidence intervals converge to the true theoretical value, will be slower, that is, more "normal". This is illustrated in this spreadsheet.

7. Conclusions

Some frameworks can handle unclean data with lack of independence between observations, lack of stationarity, non-Gaussian behavior and so on. For instance, in time series, the Hurst exponent H plays a role very similar to our exponent B. Both lie in [0, 1], take on similar values depending on whether the data is very chaotic (H and B < 1/2), unusually smooth (H and B > 1/2) or well behaved like a standard Brownian motion (H = B = 1/2).

In the theory of martingales (we are actually dealing with martingales here), there is a generalization of the central limit theorem, known as the martingale CLT, stating under which assumptions B = 1/2, even in cases where auto-correlations are strong.

However, I could not find any general framework that deals with accurately extrapolating confidence intervals beyond the size of your data set, allowing you to perform robust statistical inference with all sorts of estimators applied to messy data, without using any statistical model. Traditional re-sampling techniques based on empirical percentiles are of some use. The novelty of our approach is to bring these re-sampling techniques to a whole new level, solving problems thought to be unsolvable, for instance getting confidence intervals much sharper than those obtained with traditional methods, or getting sharp estimates for B, even in the non-standard case when B is smaller or larger than 1/2.

To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

Views: 15341

Comment