Most of the articles on extreme events are focusing on the extreme values. Very little has been written about the arrival times of these events. This article fills the gap.
We are interested here in the distribution of arrival times of successive records in a time series, with potential applications to global warming assessment, sport analytics, or high frequency trading. The purpose here is to discover what the distribution of these arrival times is, in absence of any trends or auto-correlations, for instance to check if the global warming hypothesis is compatible with temperature data obtained over the last 200 years. In particular it can be used to detect subtle changes that are barely perceptible yet have a strong statistical significance. Examples of questions of interest are:
The answer to the first question is that it is very unlikely to happen just by chance..
Despite the relative simplicity of the concepts discussed here, and their great usefulness in practice, none of the material below is found in any statistics textbook, as far as I know. It would be good material to add to any statistics curriculum.
1. Simulations
I run a number of simulations, generating 100 time series each made up of millions of random, independent Gaussian deviates, without adding any trend up or down. The first few hundred points of one of these time series is pictured in Figure 1.
Figure 1: Example of time series with no periodicity and no trend (up or down)
I computed the median, 25- and 75-percentiles for the first few records, see Figure 2. For instance, the median time of occurrence of the first record (after the first measurement) is after 2 years, if your time unit is a year. The next bigger record is expected 8 years after the first measurement, and the next bigger one 21 years after the first measurement (see Figure 2.) Even if you look at the 25-percentile, it really takes a lot of years to beat the previous 4 or 5 records in a row. In short, it is nearly impossible to observe increasing records four years in a row, unless there is a trend that forces the observed values to become larger over time.
Figure 2: Time of arrivals of successive records (in years if you time unit is a year)
This study of arrival times for these records should allow you to detect even very tiny trends, either up or down, better than traditional models of change point detection hopefully. However it does not say anything about whether the increase is barely perceptible or rather large.
Note that the values of these records is a subject of much interest in statistics, known as extreme value theory. This theory has been criticized for failure to predict the amount of damage in modern cataclysms, resulting in big losses for insurance companies. Part of the problem is that these models are based on hundreds of years worth of data (for instance to predict the biggest flood that can occur in 500 years) but over such long periods of time, the dynamics of the processes at play have shifted. Note that here, I focus on the arrival times or occurrences of these records, not on their intensity or value, contrarily to traditional extreme value theory.
Finally, arrival times for these records do not depend on the mean or variance of the underlying distribution. Figure 2 provides some good approximations, but more tests and simulations are needed to confirm my findings. Are these median arrival times the same regardless of the underlying distribution (temperature, stock market prices, and so on) just like the central limit theorem provides a same limiting distribution regardless of the original, underlying distribution? The theoretical statistician should be able to answer this question. I didn't find many articles on the subject in the literature, though this one is interesting. In the next section, I try to answer this question. The answer is positive.
2. Theoretical Distribution of Records over Time
This is an interesting combinatorial problem, and it bears some resemblance to the Analyticbridge Theorem. Let R(n) be the value of the n-th record (n = 1, 2,...) and T(n) its arrival time.
For instance, if the data points (observed values) are X(0) = 1.35, X(1) = 1.49, X(2) = 1.43, X(3) = 1.78, X(4) = 1.63, X(5) = 1.71, X(6) = 1.45, X(7) = 1.93, X(8) = 1.84, then the records (highlighted in bold) are R(1) = 1.49, R(2) = 1.78, R(3) = 1.93, and the arrival times for these records are T(1) = 1, T(2) = 3, and T(3) = 7.
To compute the probability P(T(n) = k) for n > 0 and k = n, n+1, n+2, etc., let's define T(n, m) as the arrival time of the n-th record if we only observe the first m+1 observations X(0), X(1) ... X(m). Then P(Tn) = k) is the limit of P(T(n,m) = k) as m tends to infinity, assuming the limit exists. If the underlying distribution of the values X(0), X(1), etc. is continuous, then, due to the symmetry of the problem, computing P(T(n,m) = k) can be done as follows:
As a result, the distribution of arrival times, for the records, is universal: it does not depend on the underlying distribution of the identically and independently distributed observations X(0), X(1), X(2) etc.
It is easy (with or without using my above combinatorial framework) to find that the probability to observe a record (any record) at position k is 1 / (k+1) assuming again that the first position is position 0 (not 1). Also, it is easy to prove that P(T(n) = n) = 1 / (n+1)!. Now, T(1) = k if and only if X(k) is a record among X(0), ..., X(k) and X(0) is the largest value among X(0), ..., X(k-1). Thus:
P(T(1) = k) = 1 / { (k+1) * k }
This result is confirmed by my simulations. For the general case, recurrence formulas can be derived.
3. Useful Results
None of the arrival times T(n) for the records has a finite expectation. Figure 3 displays the first few values for the probability that the n-th record occurs at position T(n) = k, the first element in the data set being assigned to position 0. The distribution of of these arrival times does not depend on the underlying distribution of the observations.
Figure 3: P(T(n) = k) at the bottom, (k+1)! P(T(n) = k) at the top
These probabilities were computed using a small script that generates all (k+1)! permutations of (0, 1, ..., k) and checks, among these permutations, those having a record at position k: for each of these permutations, we computed the total number of,records. If N(n, k) denotes the number of such permutations having n records, then P(T(n) = k) = N(n,k) / (k+1)!.
Despite the fact that the above table is tiny, it required hundreds of millions of computations for its production.
Top DSC Resources
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
I did some unpublished work on this back in 2015 you might be interested in: https://htmlpreview.github.io/?https://github.com/dmarx/statistical...
I ran simulations for numerical approximations as well, but also developed some analytic theory. To summarize: you can model the arrival time of a new maximum with a geometric distribution, whose expectation is 1/p. For a continuous distribution with CDF F(x), the probability of observing an iid draw greater than a value m is 1-F(m). Therefore, if we want to know how many new draws j it will take to get a new maximum, we have E[j|m] = 1/(1-F(m)).
I think it's more useful if we instead parameterize this by the number of draws 'n' that we have made. I started getting into working out E[j|n] and was able to reason the heuristic that E[j|n] ~= n (i.e. 'm' is a roughly "one in n" outlier). This works for small n, but underestimates j as n grows. To get a better estimator, we need E[m|n], which allows us to calculate this as E[j|n] = 1/F(E[m|n]) . In the absence of an analytic estimator, simulation is slow for large n: instead, we can apply numerical integration to quickly achieve a high precision estimator for E[m|n].
Going a bit deeper down the rabbit hole, if we have n samples, the distribution of the max is given by F(x)^n. E[F(x)^n] can be approximated numerically by monte carlo simulation (above). Alternatively, we can approximate this expectation with a median which we can calculate almost exactly from the CDF and n. Call the median for the max after n draws k_n, and we have E[j|n]=1/F(k_n).
Aaron Brown posted the following comment:
There are two problems with applying this to temperature records. First is that global mean temperature is closer to a random walk than independent observations. It takes a long time for the earth to heat up or cool down (notice that local temperatures peak about six to eight weeks after the time of maximum sunlight, and that’s just local), moreover many of the drivers of temperature have multiyear periods (El Niño, for example, as well as orbital and solar cycles).
With a random walk, the chance that the last 16 years have all been hotter than any previous year since 1880 is one in 50. If the temperatures were independent draws, the odds would be 1 in 3*10^20. So it makes a big difference. Of course, global temperatures are not quite a random walk, so the probability without human influence is less than 0.02, but on the other hand, it’s not true that the last 16 years have all been hotter than any previous year. If you look at long term global temperature estimates, you’ll see that its reasonably common to have runs of 14 out of 16 last years to be among the 16 hottest in a 137 year run, before human activities rose to a scale likely to influence global climate.
The other problem is we do know something about the distribution of global mean temperatures. That means you are not restricted to reasoning from records, you can look at how much hotter the recent temperatures were than older temperatures.
To see what I mean, consider the baseball record for triples hit in a season. The modern record was set at 36 in 1912 by Chief Wilson. The second best any player has ever done is 26. The record for strikeouts in a season was set at 223 in 2009 by Mark Reynolds, and the runner up is Adam Dunn who struck out 222 times in 2012. Now which record do you think is likely to stand longer?
If you want to make an estimate of this probability for global mean temperatures both with and without anthropogenic climate change, you either need a model for evolution of global temperatures, or use a longer dataset for control.
© 2018 Data Science Central™ Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central