# We have produced a 90-second video (click on this link to view the video) showing a 'random walk' (a particular case of a Markov process) evolving over 400,000 steps. Figure 1 below shows the last frame (out of 2,000 frames, each one with 200 new steps). Figure 1: Last frame of the video:  darker areas correspond to locations visited long ago

The video consists of 2000 frames, each showing 200 new steps (represented by red dots), as the random walk progresses over time. Sometimes we seem to get stuck locally, sometimes we are dramatically progressing at high speed. Older frames still stay on the video, but are shown in gray rather than red. The darkest areas are the oldest ones, visited at the beginning. This type of processes, as seen in this video, generates interesting patterns. A total of 400,000 points (in various colors) are generated during this 90-second video.

1. Definitions, Usage

A basic, two-state (going up or down), one-dimensional Markov process is defined as follows: You start at time t=0, walking along the X-axis (representing time). At each iteration (also called step), you move up with probability p, and down with probability q, along the Y-axis. The Y-axis could represent gain/losses in a gamble (throwing a dice), stock market gains etc.

When p = q (same chance of moving up or down), this is called a random walk. In two dimensions, you add two additional potential moves: going left or right.

Markov processes is a modeling tool extensively used in data science. In particular, it is used

• In operations research (queing theory)
• In Bayesian statistics (Markov Chains Monte Carlo inference, known as MCMC, to solve parameter estimation and forecasts / estimation in large, complex, semi-random systems involving numerous layers and interactions).
• In statistical physics, to model behavior of gas particles in a container: in this context, it is related to Brownian motion and fractals.
• In quantitative finance (Wall Street, high frequency trading). See section 3 - a new fundamental, simple Wall Street theorem.

1. To offer you the tools (R code) to replicate and enhance our results, and to teach you how to produce similar videos representing data evolving over time, using millions of colors,
2. To offer you the tools (script) to simulate random walks, and replicate our results.
3. To make you think about, and enjoy fascinating, challenging mathematical problems associated with these seemingly simple processes.

2. Challenging questions and curious facts about these random walks

You will need a very high quality random number generator if you want to produce billions of steps, to learn the long-term properties of these processes. Such random generators are discussed here (also, this related article is worth reading).

In Figure 1, we are dealing with a realization of a random walk starting at x=0, y=0 (top left corner). Even after 400,000 steps, we never go above y=0 or below x=0, except at the very beginning. In short, we seem very quickly stuck in the South-East quadrant. This is clearly shown in figure 3. Figure 2 also shows that me make great progress in the South-East quadrant, reaching x values as high as 200, and y values as low as -150, throughout the 400,000 steps. Yet, after 200,000 steps, we seem stucked: no new extreme is reached. This has to do with the peculiar distribution of extreme values (or distribution of records).

In theory though - over trillions of trillions of trillions of trillions of steps, over an eternity indeed - you are going to visit all quadrants, each quadrant being visited infinitely many times, reaching any arbirary large or negative x and y values. How weird!

In figures 3 and 4 below, the horizontal axis represents the time (measured in steps, from 0 to 400,000). The vertical axis represents extremes (biggest maxima, minima) reached over time in figure1. Extremes are either for x or y values. At each step, we move in both directions (x and y) by a random quantity uniformly distributed over [-0.5, +0.5]. The random small moves (in both directions) are independent. The mechanics are easy to understand if you read the source code (see next section). Figure 2: maximum x attained (blue) and minimum y attained (red) over 400,000 steps Figure 3: maximum y attained  (orange) and minimum x attained (green) over 400,000 steps

Here are the challenging questions we recently asked, regarding this type of processes:

• In a one-dimensional realization of a random walk, what is the probability of never crossing the X-axis?
• What is the average time between two successive crossings of the X-axis?
• Is the process (Y values) bounded or not? If not, it means that over time, you will reach any pre-specified, arbitrary large integer value.
• What is the highest point M that you are expecting to reach, after n steps? What is the statistical distribution of M, as a function of n?
• In a 2-dimensional random walk, what is the probability of permanently getting stuck in one quadrant?
• Are you likely to spend most of your time on the same side (either above or below) of the X-axis, or do you expect to fluctuate evenly, in the long run spending the same amount of time above and below?
• For a random walk on a circle, are you going to fill the entire circle (visiting every point on the circle), or only a portion of it (what proportion), depending on how big your steps are relative to the radius? What is the propotion of visited points, that are going to be visited at least twice? Let's assume that the ratio step size by circle radius is NOT a rational number.
• Same question if circle is replaced by a sphere, or an n-dimensional hypershere.
• On a 2-dimensional random walk, what is the chance of ever reaching your initial position ever again? And how many times would you expect to cross the origin?

More interesting facts

The figure below attempts at answering some of the questions in the one-dimensional, integer-value case, where at each step, we move up by one unit (+1) with probability 0.5,or down by one unit (-1) with probability 0.5.We simulated 10,000 realizations of such random walks, evolving over 5,000 steps (horizontal axis in figure 4). And we computed average, median, 25- and 75- percentiles for the all-time maximum M(n) for each of the 10,000 random walk realizations, attained after exactly n steps (n=1, ... , 5000; vertical axis in figure 4). The expectation for the maximum is very well approximated by the function

E[M(n)] = SQRT(n/2).

Note that the average (green curve) is always above the median (red curve), just like for salary distributions. The computation of E[M(n)] would greatly benefit from using a distributed architecture such as Hadoop. With my 10,000 x 5,000 = 50 million data points, I was able to approximate E[M(n)], and in particular its factor SQRT(1/2), up to only 2 decimals. Using 1 million simulated random wlaks, a great random generator, 50,000 steps for each simulated random walk, and Hadoop, one should obtain much more accurate results. But for practical purposes, my little experiment is good enough to conjecture that  E[M(n)] = SQRT(n/2). Maybe this is indeed a known mathematical result. Figure 4: expected maximum E[M(n)] (vertical axis) computed for 5,000 steps (1-D random walks). Horizontal axis represents n.

For those interested, see below the computations (script) performed to produce figure 4. This is a direct application of our model-free confidence interval algorithm:

@ay=();
@max_ay=();
\$nsteps=5000;
\$nbins=10000;

open(OUT1,"> extremes1D.txt");
open(OUT2,"> rw1D.txt");
for (\$step=0; \$step<\$nsteps; \$step++) {
for (\$k=0; \$k<\$nbins; \$k++) {
\$y=\$ay[\$k];
\$max_y=\$max_ay[\$k];
if (\$max_y eq "") { \$max_y=0; }
\$aux=rand();
if (\$aux<0.5) { \$y++; } else { \$y--; }
\$ay[\$k]=\$y;
if (\$y> \$max_y) { \$max_y=\$y; }
\$max_ay[\$k]=\$max_y;
print OUT1 "\$max_ay[\$k]\t";
print OUT2 "\$ay[\$k]\t";
if (\$step % 1000 ==0) { print "\$step (\$k)> \$ay[\$k] | \$max_ay[\$k]\n"; }
}
print OUT1 "\n";
print OUT2 "\n";
}
close(OUT2);
close(OUT1); Figure 5: 10 of the 10,000 one-dimensional random walks used to produce Figure 4

The fundamental result from section 2, the fact that E[M(n)] = SQRT(n/2), has important consequences for Wall Street traders.  Basically, it means that the highest value a stock, index or commodity will reach (on average), during a period of n time units is

Max price = exp{c*SQRT(n)} / P,

where c is a coefficient depending on the time unit and the commodity or stock in question (it's higher for volatile stocks), and P is the value of the commodity or stock in question at the beginning of the time period in question.

In order for this result to be valid, the following must be satisfied:

• You must have a diversified portfolio with many stock clusters or indices that are (1) not too volatile and (2) un-correlated. Assumption (2) needs careful analysis. This is required, because the result essentially applies to portfolios, not individual stocks.
• The market is supposed to be neutral (not tanking or in bubble-mode). However, if you look at figure 5 (an example of a perfect market-neutral simulated environment), we see many periods of steep, prolonged growth or decline, even collapse. I believe the real stock market is essentially neutral anyway (despite long bouts of bull or bear conditions), and behaves like my simulations, as arbitrage - resulting from tons of traders with competing goals -  tends to neutralize or equalize markets, in the long-term.
• Inflation is flat. The model still works for time periods of 1 or 2 years. But it can also easily be inflation-adjusted.
• The formula for Max price is derived from the fact that the logarithm of stock prices is very well modeled by 1-D random walks.

If we choose a small time unit (say 30 minutes), then daily increments in log(price) approximately follows a Gaussian distribution.

Note that the maximum can and typically occur early, at the beginning. Once solidly in negative ROI territory, you might never recover during your lifetime, though if you wait a few thousand or million years, you will eventually recover. This is very well illustrated in figure 1 and in the video.

How can you use this result?

For benchmarking purposes, comparing your gain with what you could have potentially achieved with the Max price formula applied to your portfolio as a whole. And find the root cause if you have sigificantly under-performed (rather than blaming the market), to do better next time.

4. The source code

All the source code (script to produce the data set used in the video, and R code to produce the video) can be found here.

For details about how the video was captured from the R GUI using an open-source screencast tool, click here.  At each iteration, the 200 points in the current frame are displayed in red (using col=rgb(1,0,0) in the R code), to help you visualize the progress on the screen.

Views: 12773

Comment

Join Data Science Central Comment by Vincent Granville on October 21, 2014 at 9:54am

Also, E[M(n)] = SQRT(n/2) is not the law of the iterative algorithm: my result is about a max observed over n steps, not a lim sup. And it is not the expectation of the absolute value observed at step n either: that one uses a factor 2/pi, rather than 1/2. Comment by Mirko Krivanek on October 13, 2014 at 7:33am

Regarding M(n), there is an interesting formula for P(M(n) = r), for any non-negative integer r. Denoting as S(n) the value observed at step n in a one-dimensional symmetrical random walk starting starting with S(0) = 0, moving by increments or +1 or -1 at each new step, we have

P(M(n) = r) = P(S(n) = r) + P(S(n) = r+1) = max{P(S(n) = r), P(S(n) = r+1)}.

Since S(n) is a sum of n binary random variables, its distribution is well known and approximated by a normal distribution with standard deviation SQRT(n).

Details are found in this PDF document, page 19. Note that M(n) is equal to max{S(0), S(1), ... , S(n)}.