While the Bernoulli and binomial distributions are among the first ones taught in any elementary statistical course, the Poisson-Binomial is rarely mentioned. It is however one of the simplest discrete distributions, with applications in survey analysis, see here. In this article, we are dealing with experimental / probabilistic number theory, leading to a more efficient detection of large prime numbers, with applications in cryptography and IT security.
This article is accessible to people with minimal math or statistical knowledge, as we avoid jargon and theory, favoring simplicity. Yet we are able to present original research-level results that will be of interest to professional data scientists, mathematicians, and machine learning experts. The data set explored here is the set of numbers, and thus accessible to anyone. We also explain computational techniques, even mentioning online tools, to deal with very large integers that are beyond what standard programming languages or Excel can handle.
1. The Poisson-Binomial Distribution
We are all familiar with the most basic of all random variables: the Bernoulli. If Y is such a variable, it is equal to 0 with probability p, and to 1 with probability 1 - p. Here the parameter p is a real number between 0 and 1. If you run n trials, independent from each other, and each with the same potential outcome, then the number of successes, defined as the number of times the outcome is equal to 1, is a Binomial variable of parameters n and p.
Source for picture: here
If the trials are independent but a different p is attached to each of them, then this time the number of successes has a Poisson-binomial distribution. In short, let's say that we have n independent Bernoulli random variables Y1, ..., Yn respectively with parameter p1, ..., pn, then the number of successes X = Y1 + ... + Yn has a Poisson-binomial distribution of parameters p1, ..., pn and n. The exact probability density function is cumbersome to compute as it is combinatorial in nature, but a Poisson approximation is available and will be used in this article, thus the name Poisson-binomial.
The first two moments (expectation and variance) are as follows:
The exact formula for the PDF (probability density function) involves an exponentially growing number of terms as n becomes large. For instance, P(X = n - 2) which is the probability that exactly two out of n trials fail, is given by the following formula:
For this reason, whenever possible, approximations are used.
1.1. Poisson approximation
When the parameters pk are small, say pk < 0.1, then the following Poisson approximation applies. Let λ = p1 + ... + pn. Then for m = 0, ..., n, we have:
When n becomes large, we can use the Central Limit Theorem to compute more complicated probabilities such as P(X > m), based on the Poisson approximation. See also the Le Cam theorem for more precise approximations.
2. Case study: Odds to observe many primes in a random sequence
The 12 integers below were produced with a special sequence described in the second example in this article. It quickly produces a large volume of numbers with no small divisors. How likely it is to produce such a sequence of numbers just by chance? The numbers q, q, q, q have divisors smaller than 1,000 and the remaining eight numbers have no divisor smaller than N = 15,485,863. Note that N (the one-millionth prime) is the largest divisor that I tried in that test.
Here is the answer. The probability for a large number x to be prime is about 1 / log x, by virtue of the Prime Number Theorem. The probability for a large number x to have no divisor smaller than N is
where the product is over all primes p < N and γ = 0.577215… is the Euler–Mascheroni constant. Here ρN ≈ 0.033913. See here for an explanation of the equality on the left side. The right-hand formula is known as the Mertens theorem. See also here. The symbol ~ represents asymptotic equivalence. Thus the probability to observe 4 large numbers out of 12 having no divisor smaller than N is
For the above numbers q,⋯,q, the probability in question is not small. For instance, it is equal to 0.47, 0.36 and 0.23 respectively for q, q and q. Other sequences producing a high density of prime numbers are discussed here and here.
2.1. Computations based on the Poisson-Binomial distribution
Let us denote as pk the probability that q[k] is prime, for k =1, ...,12. As discussed earlier in section 2, pk = 1 / log q[k] is small, and the Poisson approximation can be used when dealing with the Poisson-binomial distribution. So we can use the formula in section 1.1. with λ = p1 + ... + pn and n = 12.
Thus, λ = 0.11920 (approx.) Now we can compute P(X = m) for m = 8, 9, 10, 11,12:
The chance that 8 or more large numbers are prime among q,⋯,q is the sum of the 5 probabilities in the above table. It is equal to 9.1068 / 10^13. That is, less than one in a trillion.
2.2. Technical note: handling very large numbers
Numbers investigated in this research have dozens and even hundreds of digits. The author has routinely worked with numbers with millions of digits. Below are some useful tools to deal with such large numbers.
3. Cryptography application
Many cryptography systems rely on public and private keys that feature the product of two large primes, typically with hundreds or thousands of binary digits. Producing such large primes was not an easy task until efficient algorithms were created to check if a number is prime or not. These algorithms are known as primality tests. Some are very fast but only provide a probabilistic answer: the probability that the number in question is a prime number, which is either zero or extremely close to one. These algorithms rely on sampling a large number of primes to identify prime candidates, and then determine their status (prime or not prime) with an exact but more costly test.
Remember that the probability for a random, large integer p to be prime, is about 1 / log p. So if you test 100,000 numbers close to 10^300, you'd expect to find 145 primes. Not a very efficient strategy. One way to improve these odds by an order of magnitude, is to pick up integers belonging to sequences that are prime-rich: such sequences can contain 10 times more primes than random sequences. This is where the methodology discussed here becomes handy. Such sequences are discussed in two of my articles: here and here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent's articles and books, here.