Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discuss our popular model-free confidence interval technique introduced in our original Analyticbridge article, as part of our (open source) intellectual property sharing. This technique has the following advantages:
This is part of our series on data science techniques suitable for automation, usable by non-experts. The next one to be detailed (with source code) will be our Hidden Decision Trees.
Figure 1: Confidence bands based on our CI (bold red and blue curves) - Comparison with traditional normal model (light red anf blue curves)
Figure 1 is based on simulated data that does not follow a normal distribution : see section 2 and Figure 2 in this article. Classical CI's are just based on 2 parameters: mean and variance. With the classical model, all data sets with same mean and same variance have same CI's. To the contrary, our CI's are based on k parameters - average values computed on k different bins - see next section for details. In short, they are better predictive indicators when your data is not normal. Yet they are so easy to understand and compute, you don't even need to understand probability 101 to get started. The attached spreadsheet and Perl scripts have all computations done for you.
1. General Framework
We assume that we have n observations from a continuous or discrete variable. We randomly assign a bin number to each observation: we create k bins (1 ≤ k ≤ n) that have similar or identical sizes. We compute the average value in each bin, then we sort these averages. Let p(m) be the m-th lowest average (1 ≤ m ≤ k/2, with p(1) being the minimum average). Then our CI is defined as follows:
The confidence level represents the probability that a new observation (from the same data set) will be between the lower and upper bounds of the CI. Note that this method produces asymetrical CI's. It is equivalent to designing percentile-based confidence intervals on aggregated data. In practice, k is chosen much smaller than n, say k = SQRT(n). Also m is chosen to that 1 - 2m/(k+1) is as close as possible to a pre-specified confidence level, for instance 0.95. Note that the higher m, the more robust (outlier-nonsensitive) your CI.
If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's (with different values of m), with confidence level close to 0.95. Then inperpolate or extrapolate the lower and upper bounds to get a CI with 0.95 confidence level. The concept is easy to visualize if you look at Figure 1. Also, do proper cross-validation: split your data in two; compute CI's using the first half, and test them on the other half, to see if they still continue to have sense (same confidence level, etc.)
CI's are extensively used in quality control, to check if a batch of new products (say, batteries) have failure rates, lifetime or other performance metrics that are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each batch, year after year. By "acceptable" or "reasonable", we mean between the upper and lower bound of a CI with pre-specified confidence level. CI are also used in scoring algorithms, to provide CI to each score.The CI provides an indication about how accurate the score is. Very small confidence levels (that is, narrow CI's) corresponds to data well understood, with all sources of variances perfectly explained. Converserly, large CI's mean lot's of noise and high individual variance in the data. Finally, if your data is stratified in multiple heterogeneous segments, compute separate CI's for each strata.
That's it, no need to know even rudimentary statistical science to understand this CI concept, as well as the concept of hypothesis testing (derived from CI) explained below in section 3.
When Big Data is Useful
If you look closely at Figure 1, it's clear that you can't compute accurate CI's with a high (above 0.99) level, with just a small sample and (say) k=100 bins. The higher the level, the more volatile the CI. Typically, an 0.999 level requires 10,000 or more observations to get something stable. These high-level CI's are needed especially in the context of assessing failure rates, food quality, fraud detection or sound statistical litigation. There are ways to work with much smaller samples by combining 2 tests, see section 3.
An advantage of big data is that you can create many different combinations of k bins (that is, test many values of m and k) to look at how the confidence bands in Figure 1 change depending on the bin selection - even allowing you to create CI's for these confidence bands, just like you could do with Bayesian models.
2. Computations: Excel, Source Code
The first step is to re-shuffle your data to make sure that your observations are in perfect random order: read A New Big Data Theorem section in this article for an explanation why reshuffling is necessary (look at the second theorem). In short, you want to create bins that have the same mix of values: if the first half of your data set consisted of negative values, and the second half of positive values, you might end up with bins either filled with positive or negative values. You don't want that; you want each bin to be well balanced.
Reshuffling Step
Unless you know that your data is in an arbitrary order (this is the case most frequently), reshuffling is recommended. Reshuffling can easily be performed as follows:
Note that we use 100,000 + INT(10,000*RAND()) rather than just simply RAND() to make sure that all random numbers are integers with the same number of digits. This way, whether you sort alphabetically or numerically, the result will be identical, and correct. Sorting numbers of variable length alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.
If the order in your data set is very important, just add a column that has the original rank attached to each observation (in your initial data set), and keep it through the res-shuffling process (after each observation has been assigned to a bin), so that you can always recover the original order if necessary, by sorting back according to this extra column.
The Spreadsheet
Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as well as all CI computations, and more. The spreadsheet illustrates many not so well known but useful analytic Excel functions, such as: FREQUENCY, PERCENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations) and RANK. The CI computations are in cells O2:Q27 in the Confidence Intervals tab. You can modify the data in column B, and all CI's will automatically be re-computed. Beware if you change the number of bins (cell F2): this can screw up the RANK function in column J (some ranks will be missing) and then screw up the CI's.
For other examples of great spreadsheet (from a tutorial point of view), check the Excel section in our data science cheat sheet.
Simulated Data
The simulated data in our Excel spreadsheet (see the data simulation tab), represents a mixture of two uniform distributions, driven by the parameters in the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure 2) were stored in column D, and were subsequently hard-copied into column B in the Confidence Interval (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the RAND Excel function are automatically updated, changing everything and making our experiment non-reproducible. This is a drawback of Excel, thought I've heard that it is possible to freeze numbers produced by the function RAND. The simulated data is remarkably non-Gaussian, see Figure 2. It provides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.
In any case, this is an interesting tutorial on how to generate simulated data in Excel. Other examples can be found in our Data Science Cheat Sheet (see Excel section).
Comparison with Traditional Confidence Intervals
We provide a comparison with standard CI's (available in all statistical packages) in Figure 1, and in our spreadsheet. There are a few ways to compute traditional CI's:
As you can see in Figure 1, traditional CI's are very narrow. Note that inflating the traditional CI's by a factor SQRT(k), that is, replacing $F$6+R3 by $F$6+SQRT($F$2)*R3 in cell S3 in our spreadsheet (and similar adjustments in all cells in columns S and T), leads to similar CI's. Indeed, traditional CI's have been designed for the mean, while ours are designed for bin averages (that is, batch averages in quality control), or even individual values (when n=k).This explains most of the discrepancy. Finally, our methodology is better when n (the number of observations) is small (n < 100), or for high confidence levels (> 0.98) or when your data has outliers.
Perl Code
Here's some simple source code to compute CI for given m and k:
$k=50; # number of bins
$m=5;
open(IN,"< data.txt");
$binNumber=0;
while ($value=<IN>) {
$value=~s/\n//g;
$binNumber = $n % $k;
$binSum[$binNumber] += $value;
$binCount[$binNumber] ++;
$n++;
}
if ($n < $k) {
print "Error: Too few observations: n < k (choose a smaller k)\n";
exit();
}
if ($m> $k/2) {
print "Error: reduce m (must be <= k/2)\n";
exit();
}
for ($binNumber=0; $binNumber<$k; $binNumber++) {
$binAVG[$binNumber] = $binSum[$binNumber]/$binCount[$binNumber];
}
$binNumber=0;
foreach $avg (sort { $a <=> $b } @binAVG) { # sorting bins numerically
$sortedBinAVG[$binNumber] = $avg;
$binNumber++;
}
$CI_LowerBound= $sortedBinAVG[$m];
$CI_UpperBound= $sortedBinAVG[$k-$m+1];
$CI_level=1-2*$m/($k+1);
print "CI = [$CI_LowerBound,$CI_UpperBound] (level = $CI_level)\n";
Exercise: write the code in R or Python.
3. Application to Statistical Testing
Rather than using p-values and other dangerous concepts (about to become extinct) that nobody but statisticians understand, here is an easy way to perform statistical tests. The method below is part of what we call rebel statistical science.
Let's say that you want to test, with 99.5% confidence (level = 0.995), whether or not a wine manufacturer consistently produces a specific wine that has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufacturer is accused of lying on the bottle labels (claiming that alcohol content is 12.5% when indeed it is 13%), maybe to save some money. The test to perform is as follows: check out 100 bottles from various batches, and compute an 0.995-level CI for alcohol content. Is 12.5% between the upper and lower bounds? Note that you might not be able to get an exact 0.995-level CI if your sample size n is too small (say n=100), you will have to extrapolate from lower level CI's, but the reason here to use a high confidence level is to give the defendant the benefit of the doubt rather than wrongly accusing him based on a too small confidence level. If 12.5% is found inside even a small 0.50-level CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it will be inside an 0.995-level CI, because these CI's are nested (see Figure 1 to understand these ideas). Likewise, if the wine truly has a 13% alcohol content, a tiny 0.03-level CI containing the value 13% will be enough to prove it.
One way to better answer these statistical tests (when your high-level CI's don't provide an answer) is to produce 2 or 3 tests (but no more, otherwise your results will be biased). Test whether the alcohol rate is
4. Miscellaneous
We include two figures in this section. The first one is about the data used in our test and Excel spreadsheet, to produce our confidence intervals. And the other figure shows the theorem that justifies the construction of our confidence intervals.
Figure 2: Simulated data used to compute CI's: asymmetric mixture of non-normal distrubutions
Figure 3: Theorem used to justify our confidence intervals
Comment
Regarding to shuffeling, how to insure the difference between bins is good (there is no similar bins, so bins are very different), if number of bins is 2^n , then hadamard matrix may be used.
https://en.wikipedia.org/wiki/Hadamard_matrix
But if not...
I am very suprising , no body pays attention to this problem, or may be there is one ?
This has been re-posted (+/-), and so brought back to my attention ... and my immediate reaction was "why bother (doing this by a new method"). Which my immediate reaction the first time around too. So this time I will post some thoughts.
If one is doing work for other people, often/usually you do a walk-through of the methodology/data/limitations, before you get to the fun stuff, aka the Results. The Results are where you want to spend your time, and it's where the customer wants to spend time. The time it takes to explain "C.I.s by binning" is a diversion. Moreover, it brings in two risks: the low quant part of the audience may not understand it (given your fast description), and thus begin to have lingering doubts about your work, whereas the high quant part of your audience will understand it, but possibly not agree with your choice to use an unconventional approach when others are readily available, and thus begin to have lingering doubt about your work.
So I see downsides of using this approach, but no compelling upside. The binning method means that each observation can be in one and only one bin. So it's sampling without replacement. I'm not the greatest of theoretical statisticians but I have it in mind that sampling with replacement is generally accepted as more likely to root out the true variability of the data in hand. Data that lives in Excel by my definition cannot be "Big Data," and it's easy to kick it out to R or some other s/w that makes bootstrapping easy.
Why is this preferable to doing an empirical estimation of confidence intervals by bootstrap resampling, or monte carlo resampling? (For estimating CI's, I would say this difference between the two is that bootstrapping grabs samples with replacement that approximate the size of the original sample, while monte carlo would tend to use smaller samples.)
I can see where, for really really big data, binning in one pass could be materially faster than bootstrapping. (Especially if you eliminate the need to do a randomized sort of all the data before assigning bins.) But is the time saved worth it for datasets of moderate size, where a bootstrapping or monte carlo approach of say 1000 iterations might take only take a couple of minutes at worst case ?
I work with health care data, none of which ever seems to have a normal distribution, and routinely use the monte carlo approach to put a range of fuzziness on what would otherwise be point estimates. It's not at all difficult, and not very time-coming, measured in either human or computer effort.
Please help me understand under what conditions this binning approach would be a better choice to use.
This is just a terminology issue. When I present a result I cannot say this is "confidence interval" because of people assume this as a regular statistical CI. Nor I don't want to explain this methods to an audience. I'd prefer to calculate CI using a regular method for a presentation.
This method is good at practice - fast iterating through hypothesis and data sets. There is no doubting the fact that data scientists need more methods like this.
Dmitry and Thomas: How can you claim it's wrong when it produces the exact same results as "standard" theory, asymptotically, for normal distributions? Actually it is easy to explain why it produces the same results: you just have to use some limit theorems to prove the validity.
I agree that this approach is not correct form a statistical point of view.
Why this method can be interesting? Because it is easy to implement in SQL\NoSQL\Hadoop and run on top of terabytes data set. I use one simple hypothesis testing method (population proportion) which is not "statistically correct" but easy to implement in SQL\NoSQL. I use the method a lot and probably I should write a blog post about it.
These "not correct" methods are good in Ad Hoc analytics projects with large data sets. In these projects you should iterate through many hypotheses. The iteration time is crucial. When problem is localized and dataset is reduced I use correct statistical methods.
Dear Dr Granville,
This article is more than 6 months old.
And totally wrong, from a statistical point of view.
BTW I do love the "rebel statistical science" you are referring to, it made my day.
Best
T
I agree, Peter. The best practice is to use a tool for the purpose it is actually designed for. Progress will come from learning the deficiencies in the existing tools and making improvements in the same spirit as Data Science Central is advocating for.
Hi Khurram,
Complexities will kick in, as skewness will induce heteroscedasticity. No problem when using the right model but creating chaos when left unaddressed. Time series data will raise challenges like non-stationarity. Vincent's approach is refreshing but just scratching the surface. I'm new to Data Science Central but am happy watching at the sideline for now. My point here is that we should not throw away the achievements of classic descriptive statistics and modelling just because classic inferential statistics (P-values) is under attack. The benefit is in joining the old and the new...
Thanks Peter for highlighting important points here. I also had the same impression as indicated here. Also, as for the comment that "Our methodology is better when n (the number of observations) is small (n < 100), or for high confidence levels (> 0.98) or when your data has outliers", it is irrelevant for the big data problems where sample size is large. The central limit theorem will kick in for moderately large data sets, rendering the Gaussian CIs robust against outliers as well. However, If the task is to obtain prediction intervals for new data, one can rely on nonparametric density estimation techniques. Nonetheless, it will be interesting to see how this method compares with existing approaches when data arises from highly skewed multimodel populations.
To really feed your method with some realistically simulated data I suggest to generate lognormal data with a range of different sigma's and see how you compete with the traditional techniques. A good link to theory, application and generality of the lognormal is here.
© 2016 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central