We illustrate pattern recognition techniques applied to an interesting mathematical problem: The representation of a number in non-conventional systems, generalizing the familiar base-2 or base-10 systems. The emphasis is on data science rather than mathematical theory, and the style is that of a tutorial, requiring minimum knowledge in mathematics or statistics. However, some off-the-beaten-path, state-of-the-art number theory research is discussed here, in a way that is accessible to college students after a first course in statistics. This article is also peppered with mathematical and statistical oddities, for instance the fact that there are units of information smaller than the bit.
You will also learn how the discovery process works, as I have included research that I thought would lead me to interesting results, but did not. In all scientific research, only final, successful results are presented, while actually most of the research leads to dead-ends, and is not made available to the reader. Here is your chance to discover these hidden steps, and my thought process!
The topic discussed is one of active research with applications to Blockchain or strong encryption. It is of interest to agencies such as the NSA or private research labs working on security issues, which explains why it is not easy to find many references; some are probably classified documents. As far as I know, it is not part of any university curriculum either. Indeed, the fear of reversibility (successful attacks on cryptographic keys using modern computer networks, new reverse-engineering algorithms, and distributed architecture) has led to the development of quantum algorithms and quantum computers, as well as Blockchain.
All the results in this article were obtained without writing a single line of code, and replicable as I share my Excel spreadsheets. See also applications to Fintech in the last section.
1. General Framework
All standard number representation systems have a few universal components and properties. Here, for simplicity, we assume that the number x to be represented (in base b, b integer or not, or in other systems such as the logistic map) is in [0, 1]. The number x is referred to as a seed. The details are found here, and summarized in the table below.
Components of number representation systems
The three basic components are:
General properties of these systems
Some of the general properties include:
Examples of number representation systems
We describe the specifics of two systems that are further investigated in the next section. As usual, n starts with n = 1, and x(1) = x. We assume that x is in [0, 1]. In particular, a(n) represents the n-th digit.
The base-b system (with b an integer here) is characterized by:
The Logistic Map system is characterized by:
The brackets in these formulas represent the integer part function.
Examples of patterns in digits distribution
Below are the digits of x = SQRT(2) - 1 in base b = Pi. This number x is supposed to be a good seed in that system, yet its digits exhibit strong patterns. However all good seeds exhibit the same exact patterns in that system. It means that the equilibrium distribution is not uniform, and that in general, the digits are auto-correlated (with the same auto-correlation structure regardless of x, if x is a good seed.) Indeed, lack of auto-correlation, or uniform distribution of digits, for a given seed x, would mean that x is a bad seed, in that system.
To the contrary, in base 2, the artificial number whose n-th digit is equal to 1 if n * Pi - INT(n * Pi) is less than 1/3, and equal to 0 otherwise, has a known digit distribution; the digits are not auto-correlated in that system; there is no periodicity in the digits, yet it is a bad seed. There is no pattern in the digits except that the digit 0, on average, occurs twice as frequently as the digit 1. Actually, this is true if you replace Pi by any good seed in base 2. So there are as many bad seeds as good seeds, but the set of bad seeds has measure 0, while the set of good seeds has measure 1. More about this paradox is discussed here.
Purpose of this research
The purpose is to create a new number representation system, one where the digits of Pi or 0 or any number, would all have a uniform distribution, and appear random. One limitation of such a system is that some configuration of digits, those with a non-uniform distribution, can not exist by definition. Thus such a system must have infinitely many "holes", unlike the other systems discussed so far, that have no hole. We will explore this in section 3 and 4.
But first, let's start with revisiting an old system, and introducing some applied pattern analysis techniques. This is the topic of the next section.
2. Defects found in the Logistic Map system
For a long time, I thought that the logistic map system was the best one, with uniform digit distribution, and no auto-correlation between successive digits. It has some advantages in cryptographic applications, over base-2 or base-10: It is almost impossible to reconstruct the seed x if you only know its digits, or one value of x(n) for some large n; see here however to reconstruct x if you know x(n) for a rather small n. While all this is true, after further investigations, I've found some subtle departures from pure randomness in the logistic map. I mention it here for two reasons:
It is also the first time that this research result is published.
The Logistic Map system is defined in the previous section. I analyzed the distribution of blocks of digits (5 consecutive digits) and found that, contrarily to my expectations, the distribution was not uniform: See picture below. I tested 10,000 numbers (seeds) and compared the results with the base-2 system. In the picture below,
Since there are 32 possible blocks of 5 binary digits, and 10,000 seeds in the test, on average each block should occur about 312.5 times. The block distribution, whether uniform (base-2) or not (Logistic Map) does not depend on the seed, as long as you pick up good seeds (pretty much all numbers are good seeds.) By increasing the number of seeds from 10,000 to (say) 20,000, 50,000 or 100,000, you could build confidence intervals confirming that the uniform distribution hypothesis does not hold for the logistic map.
It is surprising that despite the uniform distribution for single digits, and the lack of auto-correlation, we end up with a non uniform distribution for blocks of digits. Another possible test consists of considering the sequence of digits { a(n) } as a Markov chain, and computing the observed (or theoretical!) transition probabilities of moving from one state -- say a(n) = 0 -- to another one -- say a(n+1) = 1. Yet another test consists of analyzing the gaps between successive identical digits, or successive identical blocks of digits. If the distribution is uniform, these gaps have a geometric distribution: See here how this test works, on a practical example, in a very similar context: SQRT(2) in base 10.
You can download the spreadsheet with detailed computations, here
3. First step in designing a new system
This is a step that usually no scientist would publish anywhere (it would be rejected in scientific journals), as it led me to a dead-end, with no practical value in the business context that I am interested in. However, I publish it here for the following reasons:
The purpose, as mentioned in the introduction, is to create a number representation system that has no bad seed. All standard systems have plenty of bad seeds. No one knows if Pi is one of those bad seeds, in any number representation system, be it base-2 or base-10. It is widely believed that it is not. In my new system described here, all numbers, including Pi (or zero for that matter) have digits that appear as if they were randomly generated, following some known statistical distribution.
This is part of a general research interest: Trying to prove that numbers such as Pi or SQRT(2) have digits that look random in standard systems, and trying to design better systems for Blockchain or cryptographic applications. By creating this new system, I can say that I have found one in which the digits of Pi have a random distribution compatible with the equilibrium distribution of the system in question, thus debunking the myth that it is impossible to prove or disprove, regardless of the system, that digits of popular mathematical constants are randomly distributed. However this new system is not suitable for business applications, as we will soon see. The real value is in systems studied in section 4.
First version of new number representation system
The new system is defined as follows, and it involves a parameter c (just like the base-b system involves a parameter b) which in this case must be a positive irrational number:
Thus this system also has an exact formula for x(n), just like the base-2 system. Also all digits of any seed x, are binary, equal to either 0 or 1, and x(1) = x. The equilibrium distribution is uniform on [0, 1] regardless of x or c (c irrational). You don't need to solve the stochastic integral equation attached to that system to figure this out, it is a consequence of the equidistribution theorem if c is irrational. An immediate consequence is that all seeds are good seeds. The behavior of the system depends on c (for the auto-correlation structure) but not on x. So you might as well study it using only x = 0. For instance, if c = log(2), the first few digits of x = 0 are as follows:
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0
All numbers (regardless of c) have 50% of digits equal to 0, and 50% equal to 1. As discussed earlier, this system must have holes. For instance, no number x, not even zero, regardless of c (assuming c is irrational) has periodicity in its digit distribution, unlike all other systems. It is a sound system? Can two different numbers have the same digits? It seems that it is a sound system, though I still have some doubts. However the most spectacular feature of this system is incredibly strong autocorrelations both in the { x(n) } and { a(n) } sequences, no matter which values of c or x you test. For a specific c, autocorrelations do not depend on x (so you might as well look at x =0 alone.) This is a consequence of having an equilibrium distribution. Also, there are incredibly strong conditions on { a(n) } to make it a valid digit sequence, and thus, you could say that holes are everywhere, very obvious, and indeed occupy most of the space just like empty space in the universe.
If you want to check that { x(n) } has a uniform distribution, using data science techniques, you can compute empirical percentiles for x(n) for the first 10,000 values, maybe using x = 0, and find that they perfectly match those of a uniform distribution. This is illustrated in the section 4, and we will do it just using Excel!
Holes, autocorrelations, and entropy (information theory)
The new number representation systems introduced is this section, is full of holes, and actually mostly consists of holes, with respect to the distribution of the digits, as previously stated. Here we explore this feature in more details. One striking way to look at it is as follows.
I computed the first 13 digits of more than 10,000 random numbers, for c = log(6) / log(10). I only found 26 = 2 x 13 possible (valid) configurations, listed below:
As an exercise, you can try with different values of c, or with more than 13 digits. If instead you do the same exercise in base 2, you would find that all 2^13 = 8,192 digit configurations are represented. This brings us to the concept of information. How much information does one digit provide about the number x? In base b = 1,000, one digit obviously provides three times as much information as in base b = 10. In base b = 2, one digit provides little information (just one bit), compared to base b = 10. In base b = 1.3 (yes, the base does not need to be an integer, see table in section 1) a digit provides even less information. In my new system, a digit provides even far less information. Indeed, we have this surprising fact: One digit in my system carries much less than one bit of information, proving that the bit, contrarily to popular belief, is not the smallest unit of information. Just like atoms are not the smallest particle of matter. Note that in the above table with the 26 digit configurations, each digit a(1), ... a(13) is equal to 0 exactly 13 times, and equal to 1 exactly 13 times. What a superb pattern! It sounds like a fractional factorial table used in designs of experiments or clinical trials.
Finally, autocorrelations in the { x(n) } or { a(n) } sequences, are very strong in my new system. However, these autocorrelations depend on c (but not on x.) Out of curiosity, I tried to find some values of c that yield very little correlation (lag-1 autocorrelation) between x(n) and x(n+1). Indeed, such c's are easy to find, but in all cases, it results in a strong lag-2 autocorrelation (correlation between x(n) and x(n+2)). The findings are summarized in the chart below.
Scatterplot: Lag-1 (X-axis) vs Lag-2 (Y-axis) autocorrelations in the { x(n) } sequence, for 500 values of c
Yes, this is really a scatterplot, produced using about 500 values of c, all with x = 0. Yet all points are concentrated on the special blue curve. The details are in this spreadsheet.
Note that some points in the chart seem to be a little off the main curve. This is due to the fact that any irrational value of c that is very close to a simple rational number, is causing stability issues. The 500 values of c were generated using the function RAND() in Excel, thus some are bound to be very close to simple fractions such as 7/2. Also, these computed correlations are approximations, computed on the first 1,000 terms of the sequence { x(n) }.
The lag-d autocorrelations R(d) (d = 1, 2, and so on) can be efficiently computed on the first n terms using the formula
This formula actually applies to any number representation system that has a uniform equilibrium distribution on [0, 1]. The theoretical lag-d auto-correlation can be obtained using the above formula and letting n tends to infinity. Its value depends on c and d (but not on x.) For d = 1,we have:
where g is the function that characterizes the system, in this case g(x) = x + c - INT(x + c). The same applies to the system studied in the next section, but using a slightly different g.
Finally, another drawback of this system is that it won't map onto other systems, because of these holes that standard systems do not have. To the contrary, there is a mapping between the Logistic Map and the base-2 systems, see here.
4. Towards a more general, better, hybrid system
One would think that by combining the base-2 system, together with the system described in the previous section, you would get a system with many desirable properties.This system is defined as follows:
where the parameter b plays the same role as in the base-b system, and the parameter c is the same as in section 3. If b = 1, this is identical to the system described in section 3. If b = 2 and c = 0, it is identical to the base-2 system. I am not sure whether a simple, exact formula is available for x(n).
I have found that when b = 2, based on 5,000 test seeds, all 64 possible combinations are represented for the first 6 binary digits of x. This means that this system probably has no hole, but unfortunately, it must have bad seeds: You can not win both ways. On the plus side, it also means that one digit carries at least one bit of information, usually a desirable property in business applications. The equilibrium distribution for x(n) is also the uniform distribution on [0, 1] assuming b is an integer. The sequence { x(n) } still has auto-correlations (depending on c), but that is also true for the base-2 system.
From now on, we focus exclusively on the case b = 2. Some values of c, for instance c = log(6) / log(10), yield a lag-1 autocorrelation very close to 0 (a desirable property.) By contrast, in the standard base-2 system (corresponding to c = 0) the lag-1 autocorrelation is equal to 1/2. We discuss here two data science techniques that are useful in this context.
Faulty digits, ergodicity, and high precision computing
All the systems investigated here are ergodic, in the sense that they do have a unique equilibrium distribution that does not depend on the (good) seed. In such systems, there are two approaches to investigate statistical properties, for instance to analyse auto-correlations in the { x(n) } or { a(n) } sequences:
The first approach can result in numbers (digits) that are entirely wrong, beyond n = 40, if you use standard arithmetic, due to cumulative errors spreading exponentially fast. This was the case here when using a value of b that is an even integer, due to some dynamics in the way arithmetic computations are performed in Excel (after n = 40, and strangely, only for even integer values of b, all x(n)'s suddenly vanished -- the implicit "re-seeding" mechanism failed.) The same may happen in any programming language. As a result, the second approach is favored.
However, there is another workaround: It consists of using high precision computing, and this is described here. After all, how do you expect, for a given number, to get 5,000 correct digits when machine accuracy (with standard arithmetic) is limited to 15 decimals or so?
Finding the equilibrium distribution with the percentile test
You can always find the equilibrium distribution (at least a very good approximation) by solving the stochastic integral equation associated with your system, and it typically has only one solution. It is actually not that hard to solve, even accessible to college students in their first year. It is illustrated in this article. I solved quite a few of them (with exact solution), but in each case, I started with empirical results to "guess" the equilibrium distribution, before confirming the result with a theoretical proof. We discuss this empirical approach here, which relies on percentile computations and fitting the observed (empirical) percentiles distribution with a few models. The details are available in my Excel spreadsheet (see last sub-section.)
In this case, for b = 2 and c = log(6) / log(10), I proceeded as follows:
Percentiles of x(n) computed across 5,000 seeds: n = 5 is close to uniform, n = 20 is almost uniform
Note that the fast convergence is due to using b = 2. If you use b = 1 instead (corresponding to the system described in section 3) you will need to use a much larger n to notice convergence, which brings again the issue of numerical accuracy. However, when b = 1, the system is numerically stable, you may as well use one seed (x = 0 will do, see section 3) and a large n.
Central limit theorem, random walks, Brownian motions, stock market modeling
The famous central limit theorem applies in this context in two different ways:
Data set and Excel computations
All the data (simulations) and Excel computations (formulas) related to this section, can be downloaded here.
Related articles
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.
DSC Resources
Comment
I came across a very technical paper entitled Combinatorial and probabilistic properties of systems of numeration (37-pages long, published in 2016 in Ergodic Theory and Dynamical Systems, see here -- a PDF version is available here.) However, despite my PhD in math (I even published in Journal of Number Theory) I could not make sense of it: It is mostly a list of complex theorems with long proofs and many, many equations. I am wondering what is the overlap between that paper, and my findings here. Clearly, they use dynamical systems for their framework, and so do I (the logistic map is usually studied in the context of dynamical systems.) Also they spend a lot of time discussing what their call the invariant measure of these systems, which I assume is equivalent to my "equilibrium distribution." They focus on uniqueness and existence of this measure, and on ergodic properties. I do too here, but from an applied point of view (though I also use the word ergodic at one point.) Some of my systems do not seem to be covered by their general theory (they say the number zero always has all its digits equal to zero in their framework, but my systems do not have this kind of limitation.) They also discuss slow and fast growing systems: My example in section 3 illustrates slow growth, while the example in section 4 illustrates fast growth. Many of my systems (including base-b, even if b is not an integer, continued fractions, or nested square roots) implicitly use what is called the greedy algorithm to compute the digits; they also mention the greedy algorithm throughout their article. Can someone help me understand what they are actually talking about (compared to what I am doing here?)
The same team (or at least researchers benefiting from the same funding source) wrote another interesting yet technical document entitled Fractals arising from numeration and substitutions, see here. It is the first time that I see the base-b system, with b not an integer, mentioned, besides my own research. I also mention fractals in my article (see section on Brownian motions) but none of these authors mention the stochastic integral equation, and none of them truly discuss probabilistic features of these systems. Anyway, I could not resist posting their nice fractal from their PDF document, see below:
Hi Emil,
To answer your question, my first reaction is to say that they only have disadvantages. However, I could see them useful in some encryption systems, in the sense that they add some kind of random noise to your message, and each digit provides little information if the base is smaller than 2. It comes with massive auto-correlations, but no matter what your message is, the auto-correlation structure is the same, and thus possibly not a big weakness for hackers trying to decipher your message. It comes with a cost: it requires additional computational power to encrypt / decrypt etc.
If you use a transcendental number for the base, it could make the task of hackers much harder.
Vincent
What is the advantage of a number system that uses a non-integer or even a transcendental base?
© 2018 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central