# We illustrate pattern recognition techniques applied to an interesting mathematical problem: The representation of a number in non-conventional systems, generalizing the familiar base-2 or base-10 systems. The emphasis is on data science rather than mathematical theory, and the style is that of a tutorial, requiring minimum knowledge in mathematics or statistics. However, some off-the-beaten-path, state-of-the-art number theory research is discussed here, in a way that is accessible to college students after a first course in statistics. This article is also peppered with mathematical and statistical oddities, for instance the fact that there are units of information smaller than the bit.

You will also learn how the discovery process works, as I have included research that I thought would lead me to interesting results, but did not. In all scientific research, only final, successful results are presented, while actually most of the research leads to dead-ends, and is not made available to the reader. Here is your chance to discover these hidden steps, and my thought process!

The topic discussed is one of active research with applications to Blockchain or strong encryption. It is of interest to agencies such as the NSA or private research labs working on security issues, which explains why it is not easy to find many references; some are probably classified documents. As far as I know, it is not part of any university curriculum either. Indeed, the fear of reversibility (successful attacks on cryptographic keys using modern computer networks, new reverse-engineering algorithms, and distributed architecture) has led to the development of quantum algorithms and quantum computers, as well as Blockchain

All the results in this article were obtained without writing a single line of code, and replicable as I share my Excel spreadsheets. See also applications to Fintech in the last section.

1. General Framework

All standard number representation systems have a few universal components and properties. Here, for simplicity, we assume that the number x to be represented (in base b, b integer or not, or in other systems such as the logistic map) is in [0, 1]. The number x is referred to as a seed. The details are found here, and summarized in the table below. Components of number representation systems

The three basic components are:

• An iterative algorithm x(n+1) = g(x(n)) defined by a function g that maps [0, 1] onto [0, 1]. The starting point is x(1) = x (the number to be represented, or seed, assumed to be in [0, 1] in most systems.)
• Digits of x, denoted as a(n) for the n-th digit (starting with n =1) defined by a simple function h, as follows: a(n) = h(x(n)).
• An equilibrium distribution, solution of P(X  <  y) = P(g(X)  <  y). This is a stochastic integral equation in disguise, and the exact solutions are known for all standard systems. It is used to find the typical statistical distribution of digits in a particular system, applicable to most x's in that system. The observed data for the underlying random variable X consists of the points x(1), x(2), and so on, essentially regardless of the seed. The equilibrium distribution of a system applies to almost all x's, except for very few x's known as bad seeds for the system in question.

General properties of these systems

Some of the general properties include:

• All systems have bad seeds, but the set of bad seeds, even though infinite, is extremely small: Its Lebesgue measure is equal to 0. Even most bad seeds have a statistical distribution for the digits (most of the time different from the equilibrium distribution, see digits of the bad seed x = 1/3 in base 10), though some bad seeds do not have a statistical distribution at all, for the digits (see here).
• In some cases, an exact formula is always available for x(n) and a(n), regardless of n, without having to compute previous values, including for the base-2 or base-10 systems, and for the logistic map system.
• In some cases, an exact formula is available to compute x, based on its digits only. This is the case for the base-2 and base-10 systems, but not for the logistic map system.
• Two different seeds have two different representations (sets of digits.)
• A small increase or decrease in x results in a small increase or decrease in x(n), regardless of n.
• You can retrieve x if you only know n and x(n) for a specific n (any n.) It may not always be easy from a computational point of view, but theoretically feasible.
• In most systems, the equilibrium distribution is not a uniform distribution, and the successive digits are auto-correlated. Base-2, base-10, or any integer base, are among the few exceptions. Non uniform systems can be rescaled (by transforming g) to make them uniform.
• Identifying the bad seeds is extremely hard. If it was easy, we would know by now whether the digits of Pi in base 10, have a uniform distribution or not.
• For a given system, if all the digits, across all seeds, cover all potential sequences { a(n) } without any gap -- which is the case for all standard systems -- then obviously the system in question always has an infinite number of bad seeds. In any system, the number x with digits equal to 0 if n is odd, and equal to 1 if n is even, is always a bad seed. Same with any number that has only a finite number of digits equal to 0. In short, good seeds (the immense majority of seeds) are numbers with digits that look like they are randomly distributed, though the distribution is not necessarily uniform. In some binary systems, even for good seeds, there are more 0's than 1's in the digit sequence, and autocorrelation (patterns) are obvious even to the naked eye. Yet the distribution is that of a true (and usually, simple, well known) statistical distribution -- just like correlated time series can still be modeled with statistical distributions, yet are random nevertheless. See above table for examples.

Examples of number representation systems

We describe the specifics of two systems that are further investigated in the next section. As usual, n starts with n = 1, and x(1) = x. We assume that x is in [0, 1]. In particular, a(n) represents the n-th digit.

The base-b system (with b an integer here) is characterized by: The Logistic Map system is characterized by: The brackets in these formulas represent the integer part function.

Examples of patterns in digits distribution

Below are the digits of x = SQRT(2) - 1 in base b = Pi. This number x is supposed to be a good seed in that system, yet its digits exhibit strong patterns. However all good seeds exhibit the same exact patterns in that system. It means that the equilibrium distribution is not uniform, and that in general, the digits are auto-correlated (with the same auto-correlation structure regardless of x, if x is a good seed.) Indeed, lack of auto-correlation, or uniform distribution of digits, for a given seed x, would mean that x is a bad seed, in that system. To the contrary, in base 2, the artificial number whose n-th digit is equal to 1 if n * Pi - INT(n * Pi) is less than 1/3, and equal to 0 otherwise, has a known digit distribution; the digits are not auto-correlated in that system; there is no periodicity in the digits, yet it is a bad seed. There is no pattern in the digits except that the digit 0, on average, occurs twice as frequently as the digit 1. Actually, this is true if you replace Pi by any good seed in base 2. So there are as many bad seeds as good seeds, but the set of bad seeds has measure 0, while the set of good seeds has measure 1. More about this paradox is discussed here

Purpose of this research

The purpose is to create a new number representation system, one where the digits of Pi or 0 or any number, would all have a uniform distribution, and appear random. One limitation of such a system is that some configuration of digits, those with a non-uniform distribution, can not exist by definition. Thus such a system must have infinitely many "holes", unlike the other systems discussed so far, that have no hole. We will explore this in section 3 and 4.

But first, let's start with revisiting an old system, and introducing some applied pattern analysis techniques. This is the topic of the next section.

2. Defects found in the Logistic Map system

For a long time, I thought that the logistic map system was the best one, with uniform digit distribution, and no auto-correlation between successive digits. It has some advantages in cryptographic applications, over base-2 or base-10: It is almost impossible to reconstruct the seed x if you only know its digits, or one value of x(n) for some large n; see here however to reconstruct x if you know x(n) for a rather small n. While all this is true, after further investigations, I've found some subtle departures from pure randomness in the logistic map. I mention it here for two reasons:

• If you design a strong encryption system based on the logistic map, you need to be aware of this problem, as it will make your system a little bit weaker.
• This is our first opportunity in this article to discuss pattern recognition techniques based on actual data (I share my spreadsheet at the bottom of this section.)

It is also the first time that this research result is published.

The Logistic Map system is defined in the previous section. I analyzed the distribution of blocks of digits (5 consecutive digits) and found that, contrarily to my expectations, the distribution was not uniform: See picture below. I tested 10,000 numbers (seeds) and compared the results with the base-2 system. In the picture below,

• The leftmost block on the X-axis corresponds to all 5 digits equal to 0, and it is clearly substantially under-represented for the logistic map (but not for base-2)
• The rightmost block corresponds to all 5 digits equal to 1.

Since there are 32 possible blocks of 5 binary digits, and 10,000 seeds in the test, on average each block should occur about 312.5 times. The block distribution, whether uniform (base-2) or not (Logistic Map) does not depend on the seed, as long as you pick up good seeds (pretty much all numbers are good seeds.) By increasing the number of seeds from 10,000 to (say) 20,000, 50,000 or 100,000, you could build confidence intervals confirming that the uniform distribution hypothesis does not hold for the logistic map.

It is surprising that despite the uniform distribution for single digits, and the lack of auto-correlation, we end up with a non uniform distribution for blocks of digits. Another possible test consists of considering the sequence of digits { a(n) } as a Markov chain, and computing the observed (or theoretical!) transition probabilities of moving from one state -- say a(n) = 0 -- to another one -- say a(n+1) = 1. Yet another test consists of analyzing the gaps between successive identical digits, or successive identical blocks of digits. If the distribution is uniform, these gaps have a geometric distribution: See here how this test works, on a practical example, in a very similar context: SQRT(2) in base 10. 3. First step in designing a new system

This is a step that usually no scientist would publish anywhere (it would be rejected in scientific journals), as it led me to a dead-end, with no practical value in the business context that I am interested in. However, I publish it here for the following reasons:

• It leads to more pattern recognition techniques that I will discuss in this section: It is useful for educational  purposes.
• It features one of the wildest mathematical creatures that I have seen (in this case, discovered) in a long time.
• It was my intermediate step to get to a better system (discussed in the next section), and thus it illustrates the thought process (rarely discussed in the literature or in classrooms) involved in scientific research.

The purpose, as mentioned in the introduction, is to create a number representation system that has no bad seed. All standard systems have plenty of bad seeds. No one knows if Pi is one of those bad seeds, in any number representation system, be it base-2 or base-10. It is widely believed that it is not. In my new system described here, all numbers, including Pi (or zero for that matter) have digits that appear as if they were randomly generated, following some known statistical distribution.

This is part of a general research interest: Trying to prove that numbers such as Pi or SQRT(2) have digits that look random in standard systems, and trying to design better systems for Blockchain or cryptographic applications. By creating this new system, I can say that I have found one in which the digits of Pi have a random distribution compatible with the equilibrium distribution of the system in question, thus debunking the myth that it is impossible to prove or disprove, regardless of the system, that digits of popular mathematical constants are randomly distributed. However this new system is not suitable for business applications, as we will soon see. The real value is in systems studied in section 4.

First version of new number representation system

The new system is defined as follows, and it involves a parameter c (just like the base-b system involves a parameter b) which in this case must be a positive irrational number: Thus this system also has an exact formula for x(n), just like the base-2 system. Also all digits of any seed x, are binary, equal to either 0 or 1, and x(1) = x. The equilibrium distribution is uniform on [0, 1] regardless of x or c (c irrational). You don't need to solve the stochastic integral equation attached to that system to figure this out, it is a consequence of the equidistribution theorem if c is irrational. An immediate consequence is that all seeds are good seeds. The behavior of the system depends on c (for the auto-correlation structure) but not on x. So you might as well study it using only x = 0.  For instance, if c = log(2), the first few digits of x = 0 are as follows:

0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0

All numbers (regardless of c) have 50% of digits equal to 0, and 50% equal to 1. As discussed earlier, this system must have holes. For instance, no number x, not even zero, regardless of c (assuming c is irrational) has periodicity in its digit distribution, unlike all other systems. It is a sound system? Can two different numbers have the same digits? It seems that it is a sound system, though I still have some doubts. However the most spectacular feature of this system is incredibly strong autocorrelations both in the { x(n) } and { a(n) } sequences, no matter which values of c or x you test. For a specific c, autocorrelations do not depend on x (so you might as well look at x =0 alone.) This is a consequence of having an equilibrium distribution. Also, there are incredibly strong conditions on { a(n) } to make it a valid digit sequence, and thus, you could say that holes are everywhere, very obvious, and indeed occupy most of the space just like empty space in the universe.

If you want to check that { x(n) } has a uniform distribution, using data science techniques, you can compute empirical percentiles for x(n) for the first 10,000 values, maybe using x = 0, and find that they perfectly match those of a uniform distribution. This is illustrated in the section 4, and we will do it just using Excel!

Holes, autocorrelations, and entropy (information theory)

The new number representation systems introduced is this section, is full of holes, and actually mostly consists of holes, with respect to the distribution of the digits, as previously stated. Here we explore this feature in more details. One striking way to look at it is as follows.

I computed the first 13 digits of more than 10,000 random numbers, for c = log(6) / log(10). I only found 26 = 2 x 13 possible (valid) configurations, listed below: As an exercise, you can try with different values of c, or with more than 13 digits. If instead you do the same exercise in base 2, you would find that all 2^13 = 8,192 digit configurations are represented. This brings us to the concept of information. How much information does one digit provide about the number x? In base b = 1,000, one digit obviously provides three times as much information as in base b = 10. In base b = 2, one digit provides little information (just one bit), compared to base b = 10. In base b = 1.3 (yes, the base does not need to be an integer, see table in section 1) a digit provides even less information. In my new system, a digit provides even far less information. Indeed, we have this surprising fact: One digit in my system carries much less than one bit of information, proving that the bit, contrarily to popular belief, is not the smallest unit of information. Just like atoms are not the smallest particle of matter. Note that in the above table with the 26 digit configurations, each digit a(1), ... a(13) is equal to 0 exactly 13 times, and equal to 1 exactly 13 times. What a superb pattern! It sounds like a fractional factorial table used in designs of experiments or clinical trials.

Finally, autocorrelations in the { x(n) } or { a(n) } sequences, are very strong in my new system. However, these autocorrelations depend on c (but not on x.) Out of curiosity, I tried to find some values of c that yield very little correlation (lag-1 autocorrelation) between x(n) and x(n+1). Indeed, such c's are easy to find, but in all cases, it results in a strong lag-2 autocorrelation (correlation between x(n) and x(n+2)). The findings are summarized in the chart below. Scatterplot: Lag-1 (X-axis) vs Lag-2 (Y-axis) autocorrelations in the { x(n) } sequence, for 500 values of c

Yes, this is really a scatterplot, produced using about 500 values of c, all with x = 0. Yet all points are concentrated on the special blue curve. The details are in this spreadsheet.

Note that some points in the chart seem to be a little off the main curve. This is due to the fact that any irrational value of c that is very close to a simple rational number, is causing stability issues. The 500 values of c were generated using the function RAND() in Excel, thus some are bound to be very close to simple fractions such as 7/2. Also, these computed correlations are approximations, computed on the first 1,000 terms of the sequence { x(n) }.

The lag-d autocorrelations R(d) (d = 1, 2, and so on) can be efficiently computed on the first n terms using the formula This formula actually applies to any number representation system that has a uniform equilibrium distribution on [0, 1]. The theoretical lag-d auto-correlation can be obtained using the above formula and letting n tends to infinity. Its value depends on c and d (but not on x.) For d = 1,we have: where g is the function that characterizes the system, in this case g(x) = x + c - INT(x + c). The same applies to the system studied in the next section, but using a slightly different g

Finally, another drawback of this system is that it won't map onto other systems, because of these holes that standard systems do not have. To the contrary, there is a mapping between the Logistic Map and the base-2 systems, see here.

4. Towards a more general, better, hybrid system

One would think that by combining the base-2 system, together with the system described in the previous section, you would get a system with many desirable properties.This system is defined as follows: where the parameter b plays the same role as in the base-b system, and the parameter c is the same as in section 3. If b = 1, this is identical to the system described in section 3. If b = 2 and c = 0, it is identical to the base-2 system. I am not sure whether a simple, exact formula is available for x(n).

I have found that when b = 2, based on 5,000 test seeds, all 64 possible combinations are represented for the first 6 binary digits of x. This means that this system probably has no hole, but unfortunately, it must have bad seeds: You can not win both ways. On the plus side, it also means that one digit carries at least one bit of information, usually a desirable property in business applications. The equilibrium distribution for x(n) is also the uniform distribution on [0, 1] assuming b is an integer. The sequence { x(n) } still has auto-correlations (depending on c), but that is also true for the base-2 system.

From now on, we focus exclusively on the case b = 2. Some values of c, for instance c = log(6) / log(10), yield a lag-1 autocorrelation very close to 0 (a desirable property.) By contrast, in the standard base-2 system (corresponding to c = 0) the lag-1 autocorrelation is equal to 1/2. We discuss here two data science techniques that are useful in this context.

Faulty digits, ergodicity, and high precision computing

All the systems investigated here are ergodic, in the sense that they do have a unique equilibrium distribution that does not depend on the (good) seed. In such systems, there are two approaches to investigate statistical properties, for instance to analyse auto-correlations in the { x(n) } or { a(n) } sequences:

• Look at just one good seed, and compute a large number of digits for the seed in question. All good seeds behave the same way. Even if the digits are wrong, say beyond n = 40, it has no impact on global statistical properties in general, as the error process is similar to re-starting the sequence with a new seed every now and then.
• Look at a large number of good seeds, compute the first few digits, and average across those seeds.

The first approach can result in numbers (digits) that are entirely wrong, beyond n = 40, if you use standard arithmetic, due to cumulative errors spreading exponentially fast. This was the case here when using a value of b that is an even integer, due to some dynamics in the way arithmetic computations are performed in Excel (after n = 40, and strangely, only for even integer values of b, all x(n)'s suddenly vanished -- the implicit "re-seeding" mechanism failed.) The same may happen in any programming language. As a result, the second approach is favored.

However, there is another workaround: It consists of using high precision computing, and this is described here. After all, how do you expect, for a given number, to get 5,000 correct digits when machine accuracy (with standard arithmetic) is limited to 15 decimals or so?

Finding the equilibrium distribution with the percentile test

You can always find the equilibrium distribution (at least a very good approximation) by solving the stochastic integral equation associated with your system, and it typically has only one solution. It is actually not that hard to solve, even accessible to college students in their first year. It is illustrated in this article. I solved quite a few of them (with exact solution), but in each case, I started with empirical results to "guess" the equilibrium distribution, before confirming the result with a theoretical proof. We discuss this empirical approach here, which relies on percentile computations and fitting the observed (empirical) percentiles distribution with a few models. The details are available in my Excel spreadsheet (see last sub-section.)

In this case, for b = 2 and c = log(6) / log(10), I proceeded as follows:

• Generate 5,000 seeds, assumed to be good seeds, using the RAND() function in Excel. I chose to have those seeds NOT uniformly distributed on [0, 1], to see how fast the convergence to the uniform equilibrium distribution occurs, as n increases. The seed x is also (by construction) equal to x(1).
• I computed the percentiles of x(n) for n = 1 (non uniform by construction), n = 5, and n = 20, using the percentile function in Excel. It is obvious even to the naked eye, that when n = 20, we have reached a very stable distribution, which turns out to be uniform (as expected.) The only percentile chart that corresponds to a uniform distribution is when your curve is a diagonal, and this is the case here (see red line below). The result is shown in the chart below. Percentiles of x(n) computed across 5,000 seeds: n = 5 is close to uniform, n = 20 is almost uniform

Note that the fast convergence is due to using b = 2. If you use b = 1 instead (corresponding to the system described in section 3) you will need to use a much larger n to notice convergence, which brings again the issue of numerical accuracy. However, when b = 1, the system is numerically stable, you may as well use one seed (x = 0 will do, see section 3) and a large n.

Central limit theorem, random walks, Brownian motions, stock market modeling

The famous central limit theorem applies in this context in two different ways:

• The sequences { x(n) } and { a(n) }, when properly scaled, can be used as building blocks to design random walk models and Brownian stochastic processes (see here), with applications to Fintech. More on this here (read the section on "Connection with Brownian Motions" in that article.) The connection to Brownian motions is a consequence of the classic Central Limit Theorem, with some cumulative distributions converging to a Gaussian distribution regardless of the initial distribution.
• If you look at the above chart, we started with x = x(1) being distributed according to some arbitrary (yet not uniform) distribution. Regardless of the arbitrary distribution used to sample the seed x, the limit distribution is always the same for good seeds. Not a Gaussian one, but uniform in this case. In many number representation systems (see table in section 1) the limit distribution -- called equilibrium distribution in this article -- is not a uniform one (and never a Gaussian one either since x is in [0, 1]) yet is does not depend on the distribution of the seed. You could consider this as the Central Limit Theorem for number representation systems.

Data set and Excel computations

All the data (simulations) and Excel computations (formulas) related to this section, can be downloaded here

Related articles

DSC Resources

Views: 7828

Comment

Join Data Science Central Comment by Vincent Granville on April 30, 2018 at 8:10am

I came across a very technical paper entitled Combinatorial and probabilistic properties of systems of numeration (37-pages long, published in 2016 in Ergodic Theory and Dynamical Systems, see here -- a PDF version is available here.) However, despite my PhD in math (I even published in Journal of Number Theory) I could not make sense of it: It is mostly a list of complex theorems with long proofs and many, many equations. I am wondering what is the overlap between that paper, and my findings here. Clearly, they use dynamical systems for their framework, and so do I (the logistic map is usually studied in the context of dynamical systems.) Also they spend a lot of time discussing what their call the invariant measure of these systems, which I assume is equivalent to my "equilibrium distribution." They focus on uniqueness and existence of this measure, and on ergodic properties. I do too here, but from an applied point of view (though I also use the word ergodic at one point.) Some of my systems do not seem to be covered by their general theory (they say the number zero always has all its digits equal to zero in their framework, but my systems do not have this kind of limitation.) They also discuss slow and fast growing systems: My example in section 3 illustrates slow growth, while the example in section 4 illustrates fast growth. Many of my systems (including base-b, even if b is not an integer, continued fractions, or nested square roots) implicitly use what is called the greedy algorithm to compute the digits; they also mention the greedy algorithm throughout their article. Can someone help me understand what they are actually talking about (compared to what I am doing here?)

The same team (or at least researchers benefiting from the same funding source) wrote another interesting yet technical document entitled Fractals arising from numeration and substitutions, see here. It is the first time that I see the base-b system, with b not an integer, mentioned, besides my own research. I also mention fractals in my article (see section on Brownian motions) but none of these authors mention the stochastic integral equation, and none of them truly discuss probabilistic features of these systems. Anyway, I could not resist posting their nice fractal from their PDF document, see below: Comment by Vincent Granville on April 26, 2018 at 5:25pm

Hi Emil,

To answer your question, my first reaction is to say that they only have disadvantages. However, I could see them useful in some encryption systems, in the sense that they add some kind of random noise to your message, and each digit provides little information if the base is smaller than 2. It comes with massive auto-correlations, but no matter what your message is, the auto-correlation structure is the same, and thus possibly not a big weakness for hackers trying to decipher your message. It comes with a cost: it requires additional computational power to encrypt / decrypt etc.

If you use a transcendental number for the base, it could make the task of hackers much harder.

Vincent Comment by Emil M Friedman on April 26, 2018 at 4:50pm

What is the advantage of a number system that uses a non-integer or even a transcendental base?