Subscribe to Dr. Granville's Weekly Digest

The law of series: why 4 plane crashing in 6 months is a coincidence

Very short time periods (6 months) with several crashes, as well as long time periods (3 years) with no crashes are expected. An even distribution of plane crashes is indeed NOT expected - it would look very suspicious, and definitely not random. Here we assume that all events (plane crashes) are independent. We also assume that the average is two major plane crashes per year - which is realistic if you include all passenger airlines flying anywhere in the world - sometimes in dangerous weather, or above dangerous locales.

We did our own simulation in this Excel spreadsheet. The password for our spreadsheet will be published in our Monday digest. If you don't receive our digest in your mailbox this Monday, check out your promotion, social network or spam box in your email client (look for an email with subject line Weekly digest - August 4). We simulated 10 time series with on average 2 crashes per year: each time series represents 10 years of simulated observations. That is, 20 data points = 10 years x 2 crashes per year on average, for each of the 10 time series, with each data point representing a crash event (with time stamp simulated using the RAND function in Excel - for random number simulation).

Our conclusions are as follows

  • In three of the simulated time series (out of 10), we found 4 crashes occurring within a 4-month time period; one of the ten time series had 4 crashes within a month
  • In three of the ten time series, there was a 2.5 to 3 year time period with NO crash

Here is how we did our simulations

  1. Simulate 10 time series of random numbers in Excel using the RAND Excel function. Each time series is stored in a column. For each time series, generate 20 random numbers between between 0 and 10 (each one being a time stamp, representing a crash, with time unit being a year, thus each time series having 2 crash per year on average).
  2. Copy and paste the values ONLY (not the functions) in another tab, in the Excel spreadsheet
  3. Delete initial tab containing the RAND function (because each time you refresh it, new random numbers are generated and it screws up the sorted numbers, see next step)
  4. In the final (un-deleted) tab, sort each column separately: you most perform 10 sorts. Keep in mind that each value (a number between 0 and 10 years) represents a time stamp when a crash occured.
  5. Compute z = x(k+3) - x(k) for each column, where x(k) is the occurence (time) for crash number k. Then compute r = min(z) over all the 20 rows, for each column (time series). The number r represents the shortest time period (for each 10-year time series) where 4 crashes occured. Since we simulated 10 time series, we can easily compute confidence intervals for r.
  6. Compute y = x(k+1) - x(k) for each column, where x(k) is occurence (time) for crash number k. Then compute s = max(y) over all the 20 rows, for each column (time series). The number s represents the longest time period (for each 10-year time series) with no crash. Since we simulated 10 time series, we can easily compute confidence intervals for s.

Test our results with a mathematical model

We performed the Monte-Carlo simulations for you, but now we invite you to solve the problem using mathematical models. There is indeed an exact solution, easy to compute, for this problem. Let us know if your theoretical solution yields similar results. Here's how to proceed:

  1. The number N of events (crashes) follows a Poisson process with intensity v = 2 (2 crashes per year on average; one year = time unit). So the probability that exactly N = n events occur in a time period of length T = t is exp(-vt) * (vt)^n / n!
  2. The probability p that a 4-month (one year divided by 3) time period has exactly n = 4 accidents, is p = exp(-v/3) * (v/3)^n / n!
  3. The probability that there is at least one 4-month time period (any time period) within a 10-year time period, with 4 crashes, follows a binomial distribution of parameters (30, p) where 30 = 10 years multiplied by three 4-month periods per year.

Note

Even if there are many crashes in a particular small time period, it does not mean that the likehood for a new crash in the next month is reduced, or increased. We are dealing here with memory-less processes.

Related articles

Views: 1686

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Alexander Kashko on August 12, 2014 at 12:31am

Hmm.... I am a bit out of practice with full mathematical rigor (mortis?) but here are my thoughts

Assume the probabilty of a binary event happenning per unit time is p. The probability of it not happening is q = 1-p

Assume p is small (say 0.01) then q is high. This means for any value of N the probability of an empty sequence (i.e the event not happening) is q**N and much higher than the probability of  a full sequence of length N. 

As a result if   a realisation of this process is drawn out as a linear graph it would be dominatd by  large empty spaces and smaller spaces in which something would be happening.  It would look as if events were happening in clusters, even though they are random and a naive analysis would assume there was an underlying cause. 

The interesting case is where p=q = 0;5 in which case  I think it would look like an even distribution and be statistically symmetric  between did and did not happen

As to air crashes I heard that 40% of all air journeys  involve near misses. Clearly this means a crash now and then is inevitable. It is amazing that  mid air collisions do not happen very often


I understand this is known as the inspector paradox.  I read that is applies ot the fact that busses may depart from a stop with an average ten minute gap but  you always seem to have a long wait. This is because the  arrival times are dominated by long waits. If you arrive randomly then you are likely to be in the middle of a long gap. 

It would seem that the way to detect this, if theprobabilities are unknown, would be to look at the distribution of intervals between the events ( e.g bus arrivals).  

Now to get back to work while thinking about the full analysis, which isprobably on Wikipedia somewhere if I get  the right search term.

Follow Us

Videos

  • Add Videos
  • View All

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service