While everyone talk about unusual extreme weather events (heat waves, cold spells, floods, droughts), very few, including scientists, have been able to make sound predictions for extreme events, be it weather or stock market extreme behavior, or any bubble. Here you will learn how to produce simple model-free confidence intervals for extreme events in Excel, how to generate (correlated) simulated stock market data and (uncorrelated) natural data such as air pollution index, understand why extreme events are predictable in one case but not in the other one, predict them, and learn about how to simulate ad-hoc data. Extreme events are a big issue for insurance companies and anti-terrorist agencies: predicting expected loss is critical to set up premiums correctly.

We propose a solution to predict extreme events. It is based on simulations, and very different from extreme value theory (a statistical technique), which so far has proved useless. This article was first proposed as a DSC challenge of the week, but we believe it is important enough to make it a featured article and tutorial of its own.

Here, in our Excel spreadsheet (click on the link to download), we perform simulations to predict extreme values, with model-free confidence intervals, using a very simple methodology that does not involve any statistical knowledge or statistical models. We generated two types of data:

- Stock market simulations, where the value at time k depends on the value at time k-1 (random walk simulations)
- Air index simulations, where the value at time k is independent from value at time k-1, and unbounded

In case #1, predicting extreme values is impossible (there is lack of convergence), while in case #2, it is easy, as illustrated in the charts below. The simulations in case #2 try to reproduce a phenomenon where each new iteration can generate a value larger than all previously observed values. The record computed over the past k observations **is denoted as R(k)**. For each of the two examples, we produced 10 simulations (time series) each with 10,240 data points, then computed the confidence interval for the highest value R(k) observed over the last k iterations (for various values of k), based on these 10 time series.

**Data Generation**

To generate the stock market data, we used basic random walks, see Excel spreadsheet. It produces values that are NOT (statistically speaking) independent. For the air pollution index A(k) at time k, we first generate a random deviate U on [0, 1], then set A(k) to A(k) = - ln(U). The A(k) values are independent. The function used here, in this case -ln(U), must be chosen to provide a good fit with actual, observed data.

Even though our simulations clearly show that stock market highs are unpredictable, it is still possible to find patterns to make money in the stock market (insider trading is your best bet, more on this later, especially on how to use big data to beat insider traders).

**Results**

In the stock market example, we are unable to make predictions. We could start with a positive gain, then stay in negative territory forever, or in positive territory for ever, or oscillate forever.

In the case of predicting air pollution, extreme values are rather rare, rarer than for a Gaussian distribution if we use A(k) = -ln(U), and this makes predicting extreme events easy.

**DSC Resources**

- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

**Additional Reading**

- Data Scientist Reveals his Growth Hacking Techniques
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

© 2018 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

## You need to be a member of Data Science Central to add comments!

Join Data Science Central