We have produced a 90-second video (click on this link to view the video) showing a 'random walk' (a particular case of a Markov process) evolving over 400,000 steps. Figure 1 below shows the last frame (out of 2,000 frames, each one with 200 new steps).
Figure 1: Last frame of the video: darker areas correspond to locations visited long ago
The video consists of 2000 frames, each showing 200 new steps (represented by red dots), as the random walk progresses over time. Sometimes we seem to get stuck locally, sometimes we are dramatically progressing at high speed. Older frames still stay on the video, but are shown in gray rather than red. The darkest areas are the oldest ones, visited at the beginning. This type of processes, as seen in this video, generates interesting patterns. A total of 400,000 points (in various colors) are generated during this 90-second video.
1. Definitions, Usage
A basic, two-state (going up or down), one-dimensional Markov process is defined as follows: You start at time t=0, walking along the X-axis (representing time). At each iteration (also called step), you move up with probability p, and down with probability q, along the Y-axis. The Y-axis could represent gain/losses in a gamble (throwing a dice), stock market gains etc.
When p = q (same chance of moving up or down), this is called a random walk. In two dimensions, you add two additional potential moves: going left or right.
Markov processes is a modeling tool extensively used in data science. In particular, it is used
The purpose of this article is three-fold:
2. Challenging questions and curious facts about these random walks
You will need a very high quality random number generator if you want to produce billions of steps, to learn the long-term properties of these processes. Such random generators are discussed here (also, this related article is worth reading).
In Figure 1, we are dealing with a realization of a random walk starting at x=0, y=0 (top left corner). Even after 400,000 steps, we never go above y=0 or below x=0, except at the very beginning. In short, we seem very quickly stuck in the South-East quadrant. This is clearly shown in figure 3. Figure 2 also shows that me make great progress in the South-East quadrant, reaching x values as high as 200, and y values as low as -150, throughout the 400,000 steps. Yet, after 200,000 steps, we seem stucked: no new extreme is reached. This has to do with the peculiar distribution of extreme values (or distribution of records).
In theory though - over trillions of trillions of trillions of trillions of steps, over an eternity indeed - you are going to visit all quadrants, each quadrant being visited infinitely many times, reaching any arbirary large or negative x and y values. How weird!
In figures 3 and 4 below, the horizontal axis represents the time (measured in steps, from 0 to 400,000). The vertical axis represents extremes (biggest maxima, minima) reached over time in figure1. Extremes are either for x or y values. At each step, we move in both directions (x and y) by a random quantity uniformly distributed over [-0.5, +0.5]. The random small moves (in both directions) are independent. The mechanics are easy to understand if you read the source code (see next section).
Figure 2: maximum x attained (blue) and minimum y attained (red) over 400,000 steps
Figure 3: maximum y attained (orange) and minimum x attained (green) over 400,000 steps
Here are the challenging questions we recently asked, regarding this type of processes:
More interesting facts
The figure below attempts at answering some of the questions in the one-dimensional, integer-value case, where at each step, we move up by one unit (+1) with probability 0.5,or down by one unit (-1) with probability 0.5.We simulated 10,000 realizations of such random walks, evolving over 5,000 steps (horizontal axis in figure 4). And we computed average, median, 25- and 75- percentiles for the all-time maximum M(n) for each of the 10,000 random walk realizations, attained after exactly n steps (n=1, ... , 5000; vertical axis in figure 4). The expectation for the maximum is very well approximated by the function
E[M(n)] = SQRT(n/2).
Note that the average (green curve) is always above the median (red curve), just like for salary distributions. The computation of E[M(n)] would greatly benefit from using a distributed architecture such as Hadoop. With my 10,000 x 5,000 = 50 million data points, I was able to approximate E[M(n)], and in particular its factor SQRT(1/2), up to only 2 decimals. Using 1 million simulated random wlaks, a great random generator, 50,000 steps for each simulated random walk, and Hadoop, one should obtain much more accurate results. But for practical purposes, my little experiment is good enough to conjecture that E[M(n)] = SQRT(n/2). Maybe this is indeed a known mathematical result.
Figure 4: expected maximum E[M(n)] (vertical axis) computed for 5,000 steps (1-D random walks). Horizontal axis represents n.
For those interested, see below the computations (script) performed to produce figure 4. This is a direct application of our model-free confidence interval algorithm:
@ay=();
@max_ay=();
$nsteps=5000;
$nbins=10000;
open(OUT1,"> extremes1D.txt");
open(OUT2,"> rw1D.txt");
for ($step=0; $step<$nsteps; $step++) {
for ($k=0; $k<$nbins; $k++) {
$y=$ay[$k];
$max_y=$max_ay[$k];
if ($max_y eq "") { $max_y=0; }
$aux=rand();
if ($aux<0.5) { $y++; } else { $y--; }
$ay[$k]=$y;
if ($y> $max_y) { $max_y=$y; }
$max_ay[$k]=$max_y;
print OUT1 "$max_ay[$k]\t";
print OUT2 "$ay[$k]\t";
if ($step % 1000 ==0) { print "$step ($k)> $ay[$k] | $max_ay[$k]\n"; }
}
print OUT1 "\n";
print OUT2 "\n";
}
close(OUT2);
close(OUT1);
Figure 5: 10 of the 10,000 one-dimensional random walks used to produce Figure 4
3. Application: Wall Street Trading
The fundamental result from section 2, the fact that E[M(n)] = SQRT(n/2), has important consequences for Wall Street traders. Basically, it means that the highest value a stock, index or commodity will reach (on average), during a period of n time units is
Max price = exp{c*SQRT(n)} / P,
where c is a coefficient depending on the time unit and the commodity or stock in question (it's higher for volatile stocks), and P is the value of the commodity or stock in question at the beginning of the time period in question.
In order for this result to be valid, the following must be satisfied:
If we choose a small time unit (say 30 minutes), then daily increments in log(price) approximately follows a Gaussian distribution.
Note that the maximum can and typically occur early, at the beginning. Once solidly in negative ROI territory, you might never recover during your lifetime, though if you wait a few thousand or million years, you will eventually recover. This is very well illustrated in figure 1 and in the video.
How can you use this result?
For benchmarking purposes, comparing your gain with what you could have potentially achieved with the Max price formula applied to your portfolio as a whole. And find the root cause if you have sigificantly under-performed (rather than blaming the market), to do better next time.
4. The source code
All the source code (script to produce the data set used in the video, and R code to produce the video) can be found here.
For details about how the video was captured from the R GUI using an open-source screencast tool, click here. At each iteration, the 200 points in the current frame are displayed in red (using col=rgb(1,0,0) in the R code), to help you visualize the progress on the screen.
For more information, check:
Other links
Comment
Also, E[M(n)] = SQRT(n/2) is not the law of the iterative algorithm: my result is about a max observed over n steps, not a lim sup. And it is not the expectation of the absolute value observed at step n either: that one uses a factor 2/pi, rather than 1/2.
Regarding M(n), there is an interesting formula for P(M(n) = r), for any non-negative integer r. Denoting as S(n) the value observed at step n in a one-dimensional symmetrical random walk starting starting with S(0) = 0, moving by increments or +1 or -1 at each new step, we have
P(M(n) = r) = P(S(n) = r) + P(S(n) = r+1) = max{P(S(n) = r), P(S(n) = r+1)}.
Since S(n) is a sum of n binary random variables, its distribution is well known and approximated by a normal distribution with standard deviation SQRT(n).
Details are found in this PDF document, page 19. Note that M(n) is equal to max{S(0), S(1), ... , S(n)}.
© 2019 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central