.

There is no statistical test that assesses whether a sequence of observations, time series, or residuals in a regression model, exhibits independence or not. Typically, what data scientists do is to look at auto-correlations and see whether they are close enough to zero. If the data follows a Gaussian distribution, then absence of auto-correlations implies independence. Here however, we are dealing with non-Gaussian observations. The setting is similar to testing whether a pseudo-random number generator is random enough, or whether the digits of a number such as π behave in a way that looks random, even though the sequence of digits is deterministic. Batteries of statistical tests are available to address this problem, but there is no one-fit-all solution.

Here we propose a new approach. Likewise, it is not a panacea, but rather a set of additional powerful tools to help test for independence and randomness. The data sets under consideration are specific mathematical sequences, some of which are known to exhibit independence / randomness or not. Thus, it constitutes a good setting to benchmark and compare various statistical tests and see how well they perform. This kind of data is also more natural and looks more real than synthetic data obtained via simulations.

**1. Definition of random-like sequences**

Since we are dealing with deterministic sequences (*xn*) indexed by *n* = 1, 2, and so on, it is worth defining what we mean by *independence* and *random-like*. These two elementary concepts are very intuitive, but a formal definition may help. You may skip this section if you have an intuitive understanding of the concepts in question, as the layman does. Independence in this context is sometimes called *asymptotic independence*, see here. Also, for all the sequences investigated here, *xn* ∈ [0,1].

**1.1. Definition of random-like and independence**

A sequence (*xn*) with *xn* ∈ [0,1] is *random-like* if it satisfies the following property. For any finite index family *h*1,…, *hk* and for any *t1*,…, *tk* ∈ [0,1], we have

The probabilities are empirical probabilities, that is, based on frequency counts. For instance,

where χ(*A*) is the indicator function (equal to 1 if the event *A* is true, and equal to 0 otherwise). Random-like implies independence, but the converse is not true. A sequence is *independently distributed* if it satisfies the weaker property

Random-like means that the *xn*'s all have the same underlying uniform distribution on [0, 1], and are independently distributed.

**1.2. Definition of lag- k autocorrelation**

Again, this is just the standard definition of auto-correlations, but applied to infinite deterministic sequences. The lag-*k* auto-correlation ρ*k* is defined as follows. First define ρ*k*(*n*) as the empirical correlation between (*x*1,…, *xn*) and (*xk*+1,… ,*xk*+*n*). Then ρ*k* is the limit (if it exists) of ρ*k*(*n*) as *n* tends to infinity.

**1.3. Equidistribution and fractional part denoted as { }**

The fractional part of a positive real number *x* is denoted as { *x* }. For instance, { 3.141592 } = 0.141592. The sequences investigated here come from number theory. In that context, concepts such as random-like and identically distributed are rarely used. Instead, mathematicians rely on the weaker concept of *equidistribution*, also called equidistribution modulo 1. Closer to independence is the concept of equidistribution in higher dimensions, for instance if two successive values (*xn*, *xn*+1) are equidistributed on [0, 1] x [0, 1].

A sequence can be equidistributed yet exhibits strong auto-correlations. The most famous example is the sequence *xn* = { *αn* } where *α* is a positive irrational number. While equidistributed, it has strong lag-*k* auto-correlations for every strictly positive integer *k*, and it is anything but random-like. A sequence that looks perfectly random-like is the digits of π: they can not be distinguished from a realization of a perfect Bernouilli process. Such random-like sequences are very useful in cryptographic applications.

**2. Testing well-known sequences**

The sequences we are interested in are *xn* = { *α n*^*p* }** ** where { } is the fractional part function (see section 1.3), *p* > 1 is a real number and *α* is a positive irrational number. Other sequences are discussed in section 3. It is well known that these sequences are equidistributed. Also, if *p* = 1, these sequences are highly auto-correlated and thus the terms *xn*'s are not independently distributed, much less random-like; the exact theoretical lag-*k* auto-correlations are known. The question here is what happens if *p* > 1. It seems that in that case, there is much more randomness. In this section, we explore three statistical tests (including a new one) to assess how random these sequences can be depending on the parameters *p* and *α*. The theoretical answer to that question is known, thus this provides a good case study to check how various statistical tests perform to detect randomness, or lack of it.

**2.1. The gap test**

The gap test (some people may call it run test) proceeds as follows. Let us define the binary digit *dn* as *dn* = ⌊2*xn*⌋. The brackets represent the integer part function. Say *dn* = 0 and *dn*+1 = 1 for a specific n. If *dn* is followed by *G* successive digits *dn*+1,…, *dn*+*G* all equal to 1 and then *dn*+*G*+1 = 0, we have one instance of a gap of length *G*. Compute the empirical distribution of these gaps. Assuming 50% of the digits are 0 (this is the case in all our examples), then the empirical gap distribution converges to a geometric distribution of parameter 1/2 if the sequence *xn* is random-like.

This is best illustrated in chapter 4 of my book *Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems, *available here.

**2.2. The collinearity test**

Many sequences pass several tests yet fail the collinearity test. This test checks whether there are *k* constants *a*1, ..., *ak* with *ak* not equal to zero, such that *xn*+*k* = *a*1 *xn*+*k-1* + *a*2 *xn*+*k*-2 + ... + *ak* *xn* takes only on a finite (usually small) number of values. In short, it addresses this question: are *k* successive values of the sequence *xn* always lie (exactly, approximately, or asymptotically) in a finite number of hyperplanes of dimension *k* - 1? This test has been used to determine that some congruential pseudo-random number generators were of very poor quality, see here. It is illustrated in section 3, with *k* = 2.

Source code and examples for *k* = 3 can be found here.

**2.3. The independence test**

This may be a new test: I could not find any reference to it in the literature. It does not test for full independence, but rather for random-like behavior in small dimensions (*k* = 2, 3, 4). Beyond *k* = 4, it becomes somewhat unpractical as it requires a number of observations (that is, the number of computed terms in the sequence) growing exponentially fast with *k*. However, it is a very intuitive test. It proceeds as follows, for a fixed *k*:

- Let
*N*> 100 be an integer - Let
*T*be a*k*-uple (*t*1,...,*tk*) with*tj*∈ [0,1] for*j*= 1, ...,*k.* - Compute the following two quantities, with χ being the indicator function as in section 1.2:

- Repeat this computation for
*M*different*k*-uples randomly selected in the*k*-dimensional unit hypercube

Now plot the *M* vectors (*PT, QT*), each corresponding to a different *k*-uple, on a scatterplot. Unless the *M* points lie very close to the main diagonal on the scatterplot, the sequence *xn* is not random-like. To see how far away you can be from the main diagonal without violating the random-like assumption, do the same computations for 10 different sequences consisting this time of truly random terms. This will give you a confidence band around the main diagonal, and vectors (*PT, QT*) lying outside that band, for the original sequence you are interested in, suggests areas where the randomness assumption is violated. This is illustrated in the picture below, originally posted here:

**Figure 1**

As you can see, there is a strong enough departure from the main diagonal, and the sequence in question (see same reference) is known not to be random-like. The X-axis features *PT*, and the Y-axis features *QT*. An example with known random-like behavior, resulting in an almost perfect diagonal, is also featured in the same article. Notice that there are fewer and fewer points as you move towards the upper right corner. The higher *k*, the more sparse the upper right corner will be. In the above example, *k* = 3. To address this issue, proceed as follows, stretching the point distribution along the diagonal:

- Let
*P*T*= (- 2 log*PT*) /*k*and*Q***T*= (- 2 log*QT*) /*k*. This is a transformation leading to a Gamma(*k*, 2/*k*) distribution. See explanations here. - Let
*P****T*=*F*(*P***T*) and*Q****T*=*F*(*Q***T*) where*F*is the cumulative distribution function of a Gamma(*k*, 2/*k*) random variable.

By virtue of the inverse transform sampling theorem, the points (*P****T*, *Q****T*) are now uniformly stretched along the main diagonal.

**3. Results and generalization**

Let's get back to our sequence *xn* = { *α n*^*p* } with *p* > 1 and *α* irrational. Before showing and discussing some charts, I want to discuss a few issues. First, if *p* is large, machine accuracy will quickly result in erroneous computations for *xn*. You need to detect when loss of accuracy becomes a critical problem, usually well below *n* = 1,000 if *p* = 5. Working with double precision arithmetic will help. Another issue, if *p* is close to 1, is the fact that randomness does not kick in until *n* is large enough. You may have to ignore the first few hundreds terms of the sequence in that case. If *p* = 1, randomness never occurs. Also, we have assumed that the marginal distributions are uniform on [0, 1]. From the theoretical point of view, they indeed are, and it will show if you compute the empirical percentile distribution of *xn*, even in the presence of strong auto-correlations (the reason why is because of the ergodic nature of the sequences in question, but this topic is beyond the scope of the present article). So it would be a good exercise to use various statistical tools or libraries to assess whether they can confirm the uniform distribution assumption.

**3.1. Examples**

The exact theoretical value of the lag-*k* auto-correlation is known for all *k* if *p* = 1. See section 5.4 in this article. It is almost never equal to zero, but it turns out that if *k* = 1, *p* = 1 and *α* = (3 + SQRT(3))/6, it is indeed equal to zero. Use a statistical package to see if it can detect this fact, or ask your team to do the test. Also, if *p* is an integer, show (using statistical techniques) that for some *a*1, ..., *a*k, we have *xn*+*k* = *a*1 *xn*+*k-1* + *a*2 *xn*+*k*-2 + ... + *ak* *xn* takes only on a finite number of values as discussed in section 2.2, and thus, the random-like assumption is always violated. In particular, *k* = 2 if *p* = 1. This is also true *asymptotically* if *p* is not an integer, see here for details. Yet, if *p* > 1, the auto-correlations are very close to zero, unlike the case *p* = 1. But are they truly identical to zero? What about the sequence *xn* = { *α*^*n* } with say *α* = log 3? Is it random-like? Nobody knows. Of course, if *α* = (1 + SQRT(5))/2, that sequence is anything but random, so it depends on *α*.

Below are three scatterplots showing the distribution of (*xn*, *xn*+1) for a few hundreds value of *n*, for various *α* and *p*, for the sequence *xn* = { *α* *n*^*p* }. The X-axis represents *xn*, the Y-axis represents *xn*+1.

**Figure 2**: *p = SQRT(7), α = 1*

Even to the trained naked eye, Figure 2 shows randomness in 2 dimensions. Independence may fail in higher dimensions (k > 2) as the sequence is known not to be random-like. There is no apparent collinearity pattern as discussed in section 2.2, at least for *k* = 2. Can you run some test to detect lack of randomness in higher dimensions?

**Figure 3**: *p = 1.4, α = log 2*

To the trained naked eye, Figure 3 shows lack of randomness as highlighted in the red band. Can you do a test to confirm this? If the test is inclusive or provide the wrong answer, than the naked eye performs better, in this case, than statistical software.

**Figure 4**: *p = 1.1, α = log 2*

Here (Figure 4) any statistical software and any human being, even the layman, can identify lack of randomness in more than one way. As *p* gets closer and closer to 1, lack of randomness is obvious, and the collinearity issue discussed in section 1.2, even if fuzzy, becomes more apparent even in two dimensions.

**3.2. Independence between two sequences**

It is known that if *α* and *β* are irrational numbers linearly independent over the set of rational numbers, then the sequences { *αn* } and { *βn* } are not correlated, even though each one taken separately is heavily auto-correlated. A sketch proof of this result can be found in the Appendix of this article. But are they really independent? Test, using statistical software, the absence of correlation if *α *= log 2 and *β* = log 3. How would you do to test independence? The methodology presented in section 2.3 can be adapted and used to answer this question empirically (although not theoretically).

**About the author**: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent's articles and books, here.

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central