Dual Confidence Regions: A Simple Introduction

This tutorial explains how to build confidence regions (the 2D version of a confidence interval) using as little statistical theory as possible. I also avoid the traditional terminology and notation such as α, Z_1-α, critical value, confidence level, significance level and so on. These can be confusing to beginners and professionals alike.

Instead, I use simulations and two keywords only: confidence region, and confidence level. The purpose is to explain the concept using a framework that will appeal to machine learning professionals, software engineers and non-statisticians. My hope is that you will gain a deep understanding of the technique, without headaches. I also introduce an alternative type of confidence region, called dual confidence region. It is asymptotically equivalent to the standard definition. In my opinion, it is more intuitive.

Example

This example comes from a real-life application. In this section I provide the minimum amount of material necessary to illustrate the methodology. The full problem is described in the last section, for the curious reader. In its simplest form, we are dealing with independent bivariate Bernoulli trials. The data set has n observations. Each observation consists of two measurements (u_k, v_k), for k=1, …, n. Here u_k = 1 if some interval B_k contains zero point (otherwise u_k = 0). Likewise, v_k = 1 if the same interval contains one point (otherwise v_k = 0).

The interval B_k can contain more than one point, but of course it can not simultaneously contain one and two points. The probability that B_k contains zero point is p; the probability that it contains one point is q, with 0< p+q <1. The goal is to estimate p and q. The estimators (proportions computed on the observations) are denoted as p₀ and q₀.

Since we are dealing with Bernoulli variables, the standard deviations are σ_p = [p(1-p)]^1/2 and σ_q = [q(1-q)]^1/2. Also the correlation between the two components of the observation vector is ρ_p,q = -pq / σ_pσ_q. Indeed the probability to observe (0, 0) is 1-p–q, the probability to observe (1, 0) is p, the probability to observe (0, 1) is q, and the probability to observe (1, 1) is zero.

Shape of the Confidence Region

A confidence region of level γ is a domain of minimum area that contains a proportion γ of the potential values of your estimator (p₀, q₀), based on your n observations. When n is large, (p₀, q₀) approximately has a bivariate normal distribution (also called Gaussian), thanks to the central limit theorem. The covariance matrix of this normal distribution is specified by σ_p, σ_q and ρ_p,q measured at p = p₀ and q = q₀. For a fixed γ, the optimum shape — the one with minimum area — necessarily has a boundary that is a contour level of the distribution in question. In our case, that distribution is bivariate Gaussian, and thus contour levels are ellipses.

Let us define

H_n(x,y,p,q)=\frac{2n}{1-\rho_{p,q}^2}\cdot \Big[\Big( \frac{x-p}{\sigma_p}\Big)^2 -2\rho_{p,q}\Big(\frac{x-p}{\sigma_p}\Big)\Big(\frac{y-q}{\sigma_q}\Big) + \Big(\frac{y-q}{\sigma_q}\Big)^2\Big].

This is the general elliptic form of the contour line. Essentially, it does not depend on n, p, q when n is large. The standard confidence region is then the set of all (x, y) satisfying H_n(x, y, p₀, q₀) ≤ G_γ. Here you choose G_γ to guarantee that the confidence level is γ. Replace ≤ by = to get the boundary of that region.

In this case G_γ is a quantile of the Hotelling distribution. In the simulation section, I show how to compute G_γ. The simulations apply to any setting, whether G_γ is a Hotelling, Fisher or any quantile. Or whether the limit distribution of your estimator (p₀, q₀) is Gaussian or not, as n — the sample size — increases. These simulations provide a generic framework to compute confidence regions.

Dual Confidence Region

The dual confidence region is simply obtained by swapping the roles of (x, y) and (p, q) in H_n(x, y, p, q). It is thus defined as the set of (x, y) satisfying H_n(p, q, x, y) ≤ H_γ. Again, you choose H_γ to guarantee that the confidence level is γ. Also, (p, q) is replaced by (p₀, q₀). This is no longer the equation of an ellipse. In practice, both confidence regions are very similar. Also, H_γ is almost identical to G_γ. The interpretation is as follows. A point (x, y) is in the dual confidence region of (p₀, q₀) if and only if (p₀, q₀) is in the standard confidence region of (x, y). We use the same n and confidence level γ for both regions. You can use the same principle to define dual confidence intervals.

Figure 1 shows an example based on simulations.

Figure 1: Example of 90% dual confidence region for (p, q)

Simulations

The simulations consist of generating N data sets, each with n observations. Use the joint Bernoulli model described in the first section, for the simulations. The purpose is to create data sets that have the same statistical behavior as your observations. In particular, use p₀ and q₀ in the bivariate Bernoulli model, for the simulations.

For each simulated data set, compute the proportions, standard deviations and correlations. They are denoted as x , y, σ_x, σ_y and ρ_x,y (one set of values per data set). Use the standard formulas from this article: for instance, σ_x = [x(1-x)]^1/2. Also compute G(x, y) = H_n(x, y, p₀, q₀) and H(x, y) = H_n(p₀, q₀, x, y) for each data set. Put the results in a table with N rows and 7 columns. Proceed as follows.

Standard confidence regions: sort the table by G(x, y).
Dual confidence region: sort the table by H(x, y).

The first γN rows in your sorted table determines your confidence region of level γ. All the (x, y) in those rows belong to your confidence region. In the first γN rows, the last value of H(x, y) — if sorted by H(x, y) — is H_γ. Likewise, the last value of G(x, y) — if sorted by G(x, y) — is G_γ. See example in Figure 1, with N = 10,000 and n = 20,000. As N increases, your simulations yield regions closer and closer to the theoretical ones. The spreadsheet with these simulations is available on my GitHub repository, here.

The Original Problem

The original problem consisted of estimating the two parameters of a perturbed lattice point process. These stochastic processes have applications in sensor and cell network optimization. Rather than a direct estimation, I used proxy statistics p, q for the estimator. This method, called minimum contrast estimation, requires a one-to-one-mapping between the original parameter space, and the proxy space.

The point count statistic discussed earlier measures the number of points of this process that are in a specific interval B_k. I used n non-overlapping intervals B₁, …, B_n, each one yielding one observation vector. The observation vectors are almost identically and independently distributed across the intervals. However, the first and second components of the vectors are negatively correlated. This explains the choice of the bivariate Bernoulli distribution for the model. The topic is discussed in details in my upcoming book, here.

About the Author

Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.