# Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations  -- a fact rarely if ever mentioned in the scientific literature, but illustrated here, together with a solution. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue.

This article is accessible to non-experts, even though we solve a special stochastic equation for the first time, providing an unexpected exact solution, for a new chaotic process that generalizes the logistic map. We also describe a general framework for continuous random number generators, and investigate the interesting auto-correlation structure associated with some of these sequences. References are provided, as well as fast source code to process big numbers accurately, and even an elegant mathematical proof in the last section.

While all the sequences investigated here are deterministic, we treat them as stochastic processes (after all, they are supposed to simulate chaos and randomness!) and we use probabilistic and statistical tools to handle them. To the best of my knowledge, this is the first time that continuous random number generators are described under a general unified framework, at least in the statistical literature. Similar methodology has been applied in several contexts pertaining to physics and mathematics. I hope that this article will jump-start some further research on this topic, and can be understood by a broad audience.

This article is also a useful read for participants in our analytical competition (\$2,000 award) as it addresses a similar stochastic integral equation problem, also with exact solution, in the related context of self-correcting random walks - another kind of memory-less process.

The approach used here starts with traditional data science and simulations for exploratory analysis, with empirical results confirmed later by mathematical arguments in the last section. For simplicity, some advanced topics required to make this framework completely rigorous, are not discussed here, for instance the stochastic fixed-point theorem for non-linear integral operators (the fixed point being a statistical distribution, not a static value) including conditions of convergence and uniqueness of solutions. 1. General Framework

We are interested in the sequence of real numbers defined iteratively by

X(k) = g(X(k - 1)) for k > 0,

with s = X(0) in [0, 1] being called the seed, and g being a mapping (real-valued function, usually non invertible) from [0, 1] onto [0,1]. The successive values X(k) can be seen as deviates from a distribution X, used for instance to simulate random numbers.

While this concept of using a statistical distribution to model deterministic values is intuitive to non-statisticians, it is not for some statisticians. So let's discuss X in more details, as it is central to this article. For the statistician, the distribution of X is the limit of the empirical distribution computed on the X(k)'s, just as if the X(k)'s were observations of some experiment involving sampling. The distribution X may or may not exist (or can be singular) depending on the function g and on the seed.

Desirable Properties

• Each X(k) has the same distribution as X on [0, 1] regardless of the seed (except possibly for rare "bad seeds".)
• X(k) is always in [0, 1] and the successive values eventually fill the whole interval in some random order.

The X(k)'s may or may not be correlated. General conditions for convergence and uniqueness are beyond  the scope of this article. Suffice it to say, for the initiated, that X is the fixed-point of a mathematical operator defined by g, and should not depend on the seed (but only on g) except possibly for a set S of "bad seeds" that yield periodicity or other non-random behavior, with the Lebesgue measure of this set S equal to 0. This is the case in all the examples discussed here.

The most simple case exhibiting the desired properties is the logistic map, defined by g(X) = 4 X(1 - X). It has been investigated thoroughly, used in many applications, generalized, and exact formulas for X(k) and the distribution of X are known: click here for details.

General Solution

In all cases, the distribution of X is solution of the following equation:

P(X  <  x) = P(g(X)  <  x).

While this sounds like a simple equation, it is actually a stochastic integral equation in disguise, and exact solutions are known only in very few cases. We will solve this equation later in this article, for a special case that at first glance seems more complicated than the logistic map. What makes this equation challenging is the fact that the function g, in all practical examples, is non-invertible on [0, 1].

Usage

Besides solving this functional equation, we are interested here in transforming the X(k)'s to obtain uniform deviates on [0, 1], for instance to produce random numbers on [0, 1]. While the X(k)'s and X also have [0, 1] as support domain, generally the distribution is not uniform, and thus a mapping is necessary to "uniformize" them. When no exact solution is known, an approximate solution obtained via interpolation on the X(k)'s empirical percentiles, can be used. Indeed that's what we did in the case that we solved (see next section), and that's how we discovered that it had indeed an exact solution. Another example of self-replicating process (fractals)

2. Examples of Chaotic Sequences, and Big Number Programming

Because of the nature of the recursion involved to compute X(k), rounding errors propagate very fast, to the point that after 40 or so iterations, all computed values of X(k) are completely wrong. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue. In some cases, it matters, and in some, it does not. See the examples below for details. Usually it is not a real issue, because it is like re-starting the sequence with a new seed every 40 iterations or so. So the fact that the conclusions made by prior authors are sound despite using totally faulty values, is an artifact of the process itself, which yields the same distribution X regardless of the seed.

Interestingly, I computed the values for a generalized logistic map (referred to as LM 0.5 below) both with big numbers  -- that is, with extreme precision -- and with traditional arithmetic including Excel, and noticed the issue, with both sequences being totally different after 50 iterations. I then computed and used the empirical percentiles from both sequences (big numbers and Excel versions) theoretically representing the same exact process, to fit them with known distributions. I discovered that P( <  p x - q x^2) = x with p close to 2, and q close 1 in both cases. However, the faulty sequence produced p and q even closer to 1 and 2, so close and with such a great fit, that I decided to check whether indeed, it was the exact solution. Had I been working only with the big numbers, I might not have found that this was indeed the exact solution, as proved in the last section.

It is possible that re-seeding the process periodically, which is what the faulty sequence does, improves the overall estimation for a number of statistics, to the point that it outperforms using correct values computed on a single seed. The source code using big numbers is provided later in this article.

Examples of Sequences

Here are a few examples. I tested them all, mostly with the seed s = Pi / 11.

• FRACT: Uses g(X) = mX - INT(mX) where m is the multiplier, and INT represents the integer part function. This is one of the most random, and thus best sequences, with all auto-correlations equal to 0, and distributed uniformly on [0, 1] without the need for any mapping. This is a consequence of the equidistribution theorem. Computations run very fast, even with big numbers. However, with traditional arithmetic, it gets stuck to 0 forever, after 10 iterations, if m = 32 This is not the case if m = 31: It then produces a faulty sequence, but one that appears random enough for practical purposes. If you avoid using a power of 2 for the multiplier m, and use a seed that is not too close to a simple rational number, you should be fine.
• ABSIN: Uses g(X) = | sin(mX) | where m is a multiplier as in FRACT. My main comment is that with big numbers, it runs very slowly. Also, it is not free from auto-correlations.
• LM 1: Uses g(X) = 4(1 - X). This is the classic logistic map, with all auto-correlations equal to 0. An exact solution is known, Also there are infinitely many "bad seeds" to avoid, but these seeds are known and extremely rare. It runs fast even with big numbers. More about LM 1 can be found here.
• LM 0.5: Uses g(X) = SQRT(4X (1 - X)). It has a curious auto-correlation structure, with exponential decay and  alternate signs. The first auto-correlation (lag-1) is equal to -1/2. So auto-correlations are quickly vanishing, having no impact on convergence. The sequence can be modified to remove the auto-correlations, using Z(k) =  sin(256 X(k)) instead of X(k). The implementation runs fast with big numbers, if you use the SQRT function rather than the power function to compute square roots. Despite the appearance, LM 0.5 is totally different from LM 1, and no smooth mapping exists to map one sequence onto the other. An exact solution for the distribution of X is available, see next section.

Discrete, Continuous Sequences and Generalizations

Finally, it would be interesting to see if a similar framework could be applied to discrete random number generators. Pretty much all random number generators in use today are discrete ones, in the sense that each X(k) generated is an integer. Most of them are periodic, though some have a very long period, see here.  Continuous random number generators like those presented here are non-periodic, have nice properties, and might be easier to study.  However, due to the finite amount of memory and data storage, for all practical purposes, because of machine precision, they become periodic when implemented, though the period can be immense.

All these sequences, whether made of integers or real numbers, are also stationary, in the sense that X(k) depends only on X(k-1) but not explicitly on k. A possible generalization consists of sequences of the form

X(k) = g(X(k-1), X(k-2))

with two seeds s = X(0) and t = X(1).

Also of interest is the sequence LM p defined by g(X) = {4X (1 - X)}^p where p is a positive parameter. The factor 4 that appears in the formula for g, guarantees that it maps [0, 1] onto [0, 1], as X(1-X) has a maximum value equal to 1/4. The parameter p must be in a certain range for the sequence to exhibit the desired randomness properties; p =1 and p = 0.5 work, but p = 0..2 or p = 1.1 do not. Low values of p result in significant auto-correlations, which eventually, when p is low enough, start impacting convergence. High values of p result in the sequence quickly converging to 0.

3. The Case LM 0.5, Auto-Regressive Time Series, Source Code

We investigate the case LM 0.5 defined by g(X) = SQRT(4X (1 - X)), that is, X(k) = SQRT( 4 X(k-1) (1 - X(k-1)) ). As discussed earlier, we have empirically obtained a solution, and proved that it was an exact solution, see next section for the proof. This solution is P(X  <  x) = 1 - SQRT(1 - x). As a result, Z(k) = 1 - SQRT(1 - X(k)) has a uniform distribution on [0, 1]. Whether there exists a simple exact formula for X(k), remains an open question. Note that for the case LM 1, an exact formula exists for X(k), see here.

Auto-correlation Structure

We have E[X] = 2/3, Var[X] = 4/45, E[g(X)] = E[X], Var[g(X)] = Var[X], and Thus Covar[X, g(X)] = E[X g(X)] - 4/9 = -2/45, and the lag-1 autocorrelation is equal to Covar[X, g(X)] / Var[X] = -1/2.

A good approximation for the lag-n autocorrelation is (-1/2)^n. This might actually be the exact value, and it is indeed the case for n = 1 and n = 2. One might be able to find the exact value by deriving a recurrence formula to compute R(n) = E[ g(g(g(...g(X)...))) ] where the symbol g appears n times inside the expectation operator E, and using the change of variable y = g(x) in the integral. The lag-n autocorrelation is equal to (R(n) - E^2[X]) / Var[X].

There are several ways to decorrelate time series, but here we simply used Z(k) = sin(256 X(k)) , as it fits our purpose of generating sequences with increased randomness (the opposite of what statisticians do in their daily life), and it works pretty well.

Auto-regressive, Non-linear Time Series

The LM 0.5 sequence can be used to simulate non-linear auto-regressive time series with an auto-correlation structure that is decaying exponentially fast. Indeed, it can be used as an alternate base model for auto-regressive processes. Each seed provides a different sequence, yet all sequences have the same distribution. Also, all sub-sequences of a given sequence have the same distribution. In short, re-seeding a sequence now and then, does not alter the distribution, as the sequence emulates a memory-less stochastic process or time series.

Related Sequences and Literature

I tried to find other articles about the LM 0.5 sequence, but could not find any. There is a lot of literature on generalized logistic maps, and my previous article includes references to 2-D logistic map and applications to image encryption and random number generators. See also the Wofram Encyclopedia entry on this subject.This is topic of active research.

• Route to chaos in generalized logistic map (2017) using g(X) = r X^p (1 - X^q). Authors: Rafał Rak, Ewa Rak. Accepted for publication in Acta Physica Polonica A.
• On some generalized discrete logistic maps (2012) using the same function g as in the previous reference. Published in Journal of Advanced Research. Author: Ahmed G. Radwan.
• A generalization of the logistic map equation (1999), mater's thesis, 57 pages. See also here. The focus is on identifying cases with an exact solution. Authors: Valerie Poulin (Carleton), Hugo Touchette (McGill.)
• A Generalization of the Logistic Map, and Its Applications In Generating Pseudo-random Numbers (2006, PDF document.) Authors: Mahesh Shastry et al.  See also similar paper by Samar M. Ismail et al, published in the proceedings of the IEEE 2015 International Conference on Science and Technology, here
• Importance of Generalized Logistic Distribution in Extreme Value Modeling (2013, available here.) Published in Scientific Research. Authors: K. Nidhin, C. Chandran.
• Design of Positive, Negative, and Alternating Sign Generalized Logistic Maps  (2015, PDF document), using g(X) = r X (a + b X). Published in Discrete Dynamics in Nature and Society. Authors: Wafaa S. Sayed et al.
• Modified Logistic Maps for Cryptographic Application (2015.) Published in Applied Mathematics. Authors: Shahram Etemadi Borujeni, Mohammad Saeed Ehsani.
• Implementation of the Logistic Map with FPGA using 32 bits fixed point standard (2017, accessible here.) Authors:  Diego A. Silva et al.
• Generalized Smooth Transition Map Between Tent and Logistic Maps (2017, click here -- it will cost you \$35) published in International Journal of Bifurcation and Chaos. Authors: Wafaa S. Sayed et al.

Didier Gonze (2015) wrote an interesting tutorial on the subject, see here or here. It describes the same generalizations of the logistic map, the Lyapunov exponent which measures how chaotic the system is, the universal Feigenbaum chaos constant associated with these processes, and the case when dealing with time-continuous time series, that is, when k is a real number rather than an integer, resulting in a simple differential equation. The context is to model population dynamics. A more comprehensive reference on the subject is the textbook Chaotic Dynamical Systems (Robert L. Devaney, 2003)  where the logistic map is referred to as the quadratic map. For a more recent textbook for beginners, read Non Linear Dynamics and Chaos (Steven H. Strogatz, 2015, 2nd Edition.) Source: Attractors for various quadratic maps

None of these authors mention LM 0.5 nor its exact solution. The fact that all computations, if based on standard arithmetic, are erroneous, is not mentioned either. It is not clear to me which conclusions are wrong if any, in these studies, because of this issue.

Source Code with Big Numbers, and Model Fitting with Excel

The source code below (in Perl) uses big numbers for LM 0.5. It is interesting to compare the results with the faulty sequence obtained with traditional arithmetic. Results are in this spreadsheet,which also features the data science technique used for model fitting, based on the empirical distribution and percentiles analysis.

Big number arithmetic also produces faulty sequences, but after many more iterations. This is not an issue, as long as X does not depend on the seed. The loss of accuracy with standard computer arithmetic becomes dramatic after 50 iterations or so. Think about this: For the case LM 1 for instance, X(k) is a polynomial (function of the seed s), of degree 2^(2^(k-1)). How do you expect to get a value correct with just one decimal, using standard arithmetic, for k as small as 50?

We encourage the reader to post a Python version (with big number arithmetic) in the comment section below. The Python libraries that I've found handle big integers, but not decimal numbers with hundreds of decimals. It might be possible to just use big integers, afterall the computations in the code below rely implicitly on big integers, via \$z=int(0.5+\$bigint*\$z)/\$bigint.

use strict;
use bignum;

my \$pi=4*atan2(1,1);
my \$bigint= 10**50;
my \$seed=\$pi/11;

my \$k;
my \$z=\$seed;
my \$zp;

open(OUT,">randgen.txt");

for (\$k=1; \$k<10000; \$k++) {

\$z=4*\$z*(1-\$z);
\$z=sqrt(\$z);

# Note: \$z=\$z**0.5 is much slower than \$z=sqrt(\$z)

\$z=int(0.5+\$bigint*\$z)/\$bigint;
\$zp=sprintf("%.10f",\$z); # get first 10 decimals

print OUT "\$k\t\$zp\n";

print "\$k -- \$zp\n";
select()->flush();       # you can delete this instruction
}

close(OUT);

The variable \$bigint in the code is used to set the precision, here to 50 decimals. Without it, the code is incredibly slow and becomes much slower at each iteration, producing values with more and more decimals.

4. Proof of Main Results

We prove the following results, for the LM 0.5 sequence, with x in [0, 1]:

• P(X  <  x) = 1 - SQRT(1 - x)
• Let Y = 1 - SQRT(1 - X). Then Y has a uniform distribution on [0, 1].
• P(X < 2x - x^2) = x.

To prove the first result, we plug X's distribution in the stochastic functional equation P(X   <   x) = P(g(X)  <  x), using the fact that g is a two-to-one convex function on [0, 1], and we check that the stochastic equation is satisfied. The second result follows immediately from the probability integral transform. The third result is left as an exercise. The proof of the first result proceeds as follows: Thus we have: and the stochastic equation to verify becomes, after successive squaring and re-arrangements: As most terms cancel out in the right-hand side of  the last equality, it becomes 1 - x = 1 - x, which shows that the stochastic equation is satisfied, thus completing the proof.

Note that there are automated tools for this type of tedious computations; they are usually referred to as symbolic math software. Some are available online.

DSC Resources

Popular Articles

Views: 9089

Comment

Join Data Science Central Comment by Matthew Boerner on November 14, 2017 at 6:23pm

I posted a Java version of the algorithm at https://www.datasciencecentral.com/forum/topics/question-how-precis... using the same seed.  Sequence diverges from the correct value at iteration #30 using my Java translation. Comment by Vincent Granville on November 13, 2017 at 1:49pm

A Python version for the source code is now available here. I haven't checked it yet though. Comment by Vincent Granville on November 13, 2017 at 9:42am

Hi Muhammad, I don't but if you find one (especially with high precision as in my Perl script) feel free to post it here too. Comment by Muhammad Anees on November 12, 2017 at 2:12am

Dear Vincent, do you have the code in R version for this? I might happen to apply Chaotic Time Series in one of an academic project, please. Comment by Jorge Reyes-Spindola on November 9, 2017 at 7:07am

My apologies, Vincent, I didn't read carefully enough. Comment by Vincent Granville on November 8, 2017 at 1:59pm

Hi Jorge, yes it says it is in Perl, and that I am looking for a Python implementation with big numbers. Do you know how to get big numbers to work in Python, if yes please let us know, this would be a great addition to this article. Thank you! Comment by Jorge Reyes-Spindola on November 8, 2017 at 1:31pm

You do realize that the code you posted is in Perl, not Python, right?