How to detect a pattern? Problem and solution.

Check the three charts below: only one shows no pattern and is truly random. Which one?

Chart #1


Chart #2


Chart #3


It is very clear that chart #3 exhibits a strong clustering pattern, unless you define your problem as points randomly distributed in an unknown domain whose boundary has to be estimated. So, the big question is: between chart #1 and #2, which one represents randomness? Look at these charts very closely for 60 seconds, then make a guess, then read on. Note that all three charts contain the same number of points - so there's no scaling issue involved here.

Let's assume that we are dealing with a spatial distrubtion of points over the entire 2-dimentional space, and that observations are seen through a small square window. For instance, points (observations) could be stars as seen on a picture taken from a telescope. 

The first issue is the fact that the data is censored: if you look at the distribution of nearest neighbor distances to draw conclusions, you must take into accont the fact that points near the boundary have fewer neighbors because some neighbors are outside the boundary. You can eliminate the bias by 

  • Tiling the observation window to produce a mathematical tessellation
  • Mapping the square observation window onto the surface of a torus
  • Apply statistical bias-correction techniques
  • Use Monte-Carlo simulations to estimate what the true distribution is (with confidence intervals) if the data was truly random

Second issue: you need to use better visualization tools to see the patterns. The fact that I use a + rather than a dot symbol to represents the points, helps: some points are so close to each other that if you represent points with dots, you won't visually see the double points (in our example, double points could correspond to double star systems - and these very small-scale point interactions are part of what makes the distribution non-random in two of our charts). But you can do much better: you could measure a number of metric (averages, standard deviations, correlation between x and y, number of points in each sub-square, density estimates, etc.) and identify metrics proving that we are not dealing with pure randomness.

In these 3 charts, the standard deviation for either x or y - in case of pure randomness - should be 0.290 plus or minus 0.005. Only one of the 3 charts succeeds with this randomness test.

Third issue: even if multiple statistical tests suggests that the data is truly random, it does not mean it really is. For instance, all three charts show zero correlation between x and y, and have mean x and y close to 0.50 (a requirement to qualify as random distribution in this case). However only one chart exhibits randomness.

Fourth issue: we need a mathematical framework to define and check randomness. True randomness is the realization of a Poisson stochastic process, and we need to use metrics that uniquely characterizes a Poisson process to check whether a point distribution is truly random or not. Such metrics could be e.g. 

  • The inter-point distance distributions
  • Number of observations in sub-squares (these counts should be uniformly distributed over the sub-squares, and a Chi-square test could provide the answer; however in our charts, we don't have enough points in each sub-square to provide a valid test result)

Fifth issue: some of the great metrics (distances between kth-neighbors) might not have a simple mathematical formula. But we can use Monte-Carlo simulations to address this issue: simulate a random process, compute the distribution of distances (with confidence intervals) based on thousands of simulations, and compare with distances computed on your data. If distance distribution computed on the data set matches results from simulations, we are good, it means our data is probably random. However, we would have to make sure that distance distribution uniquely characterizes a Poisson process, and that no non-random processes could yield the same distance distribution. This exercise is known as goodness-of-fit testing: you try to see if your data support a specific hypothesis of randomness.

Sixth issue: if you have a million points (and in high dimensions, you need much more than a million points due to the curse of dimension), then you have a trillion distances to compute. No computer, not even in the cloud, will be able to make all these computations in less than a thousand year. So you need to pick up 10,000 points randomly, compute distances, and compare with equivalent computations based on simulated data. You need to make 1,000 simulations to get confidence intervals, but this is feasible.

Here's how the data (charts 1-3) was created

  • Produce 158 random points [a(n), b(n)], n=1,...,158
  • Produce 158 random deviates u(n), v(n), n=1,...,158
  • Define x(n) as follows for n>1: if u(n) < r, then x(n) = a(n), else x(n) = s*v(n)*a(n) + [1-s*v(n)]*x(n-1), with x(1)=a(1) 
  • Define y(n) as follows for n>1: if u(n) < r, then y(n) = b(n), else y(n) = s*v(n)*b(n) + [1-s*v(n)]*y(n-1), with y(1)=b(1) 
  • Chart 1: x(n)=a(n), y(n)=b(n)
  • Chart 2: r=0.5, s=0.5
  • Chart 3: r=0.4, s=0.1


  • The only chart exhibiting randomness is chart #1. Chart #2 has significantly too low standard deviations for x and y, too few points near boundaries, and too many points that are very close to each other
  • Note that chart #1 (the random distribution) exhibits a little bit of clustering, as well as some point alignments: this is however perfectly expected from a random distribution. If the number of points in each sub-square was identical, the distribution would not be random, but would correspond to a situation where antagonist forces make points to stay as far away as possible from each other.
  • How would you test randomness if you had only two points (impossible to test), three points, or just 10 points?
  • Finally, once a pattern is detected (e.g. abnormal close proximity between neighboring points), it should be interpreted and/or leveraged, that is, it should lead to (say) ROI-positive trading rules if the framework is about stock trading, or the conclusion that double stars do exist (based on chart #2) if the framework is astronomy
  • See an example of random number generator at http://www.analyticbridge.com/profiles/blogs/new-state-of-the-art-r...


Views: 4032

Tags: Data Analytics, Patterns, Visualization


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service