Predicting records (highs or lows) - how to do it right (and without statistics)

While everyone talk about unusual extreme weather events (heat waves, cold spells, floods, droughts), very few, including scientists, have been able to make sound predictions for extreme events, be it weather or stock market extreme behavior, or any bubble. Here you will learn how to produce simple model-free confidence intervals for extreme events in Excel, how to generate (correlated) simulated stock market data and (uncorrelated) natural data such as air pollution index, understand why extreme events are predictable in one case but not in the other one, predict them, and learn about how to simulate ad-hoc data. Extreme events are a big issue for insurance companies and anti-terrorist agencies: predicting expected loss is critical to set up premiums correctly.

We propose a solution to predict extreme events. It is based on simulations, and very different from extreme value theory (a statistical technique), which so far has proved useless. This article was first proposed as a DSC challenge of the week, but we believe it is important enough to make it a featured article and tutorial of its own.

Here, in our Excel spreadsheet (click on the link to download), we perform simulations to predict extreme values, with model-free confidence intervals, using a very simple methodology that does not involve any statistical knowledge or statistical models. We generated two types of data:

  1. Stock market simulations, where the value at time k depends on the value at time k-1 (random walk simulations)
  2. Air index simulations, where the value at time k is independent from value at time k-1, and unbounded

In case #1, predicting extreme values is impossible (there is lack of convergence), while in case #2, it is easy, as illustrated in the charts below. The simulations in case #2 try to reproduce a phenomenon where each new iteration can generate a value larger than all previously observed values. The record computed over the past k observations is denoted as R(k). For each of the two examples, we produced 10 simulations (time series) each with 10,240 data points, then computed the confidence interval for the highest value R(k) observed over the last k iterations (for various values of k), based on these 10 time series.

Data Generation

To generate the stock market data, we used basic random walks, see Excel spreadsheet. It produces values that are NOT (statistically speaking) independent. For the air pollution index A(k) at time k, we first generate a random deviate U on [0, 1], then set A(k) to A(k) = - ln(U). The A(k) values are independent. The function used here, in this case -ln(U), must be chosen to provide a good fit with actual, observed data.  

Even though our simulations clearly show that stock market highs are unpredictable, it is still possible to find patterns to make money in the stock market (insider trading is your best bet, more on this later, especially on how to use big data to beat insider traders). 


In the stock market example, we are unable to make predictions. We could start with a positive gain, then stay in negative territory forever, or in positive territory for ever, or oscillate forever.

In the case of predicting air pollution, extreme values are rather rare, rarer than for a Gaussian distribution if we use A(k) = -ln(U), and this makes predicting extreme events easy.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 3411


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service