There is predictable data as far as the eye can see. Millions of variables quietly tracing the path we thought, and perhaps hoped, they would. Because there are so many, noticing when one of these variables does something unexpected is a task that is unsolvable by diligence alone. In order to spot these rare unexpected observations, we need an often-overlooked statistical analysis: anomaly detection.

~4 minute read.

~4 minute read.

Imagine for a moment that you invented a time machine. It is a shiny device that allows you to travel to the future and see what is about to happen. The first time you use it, you arrive in the near future with all the whirring, clanking, and smoke you would expect from time travel. As you squint through the haze, you become convinced the machine didn't actually work, because the near future looks the same as the time you left. You are shortly pulled back to your departure time and slowly realize that the future it showed you is exactly what is happening. Time travel wasn't as exciting as you had hoped because your immediate future was pretty much the same as your present. Turns out you could have predicted the near future without all that space-time continuum stuff. You could simply assume it will be like the present.

As it turns out, a lot of our world actually follows pretty regular patterns. Like your realization with the time machine, many are crestfallen when they realize that predictive analytics aren't often necessary to predict the future. One can usually produce a reasonably accurate prediction of the company's next quarterly income without the use of Poisson distributions or seasonal ARIMA models.

However, this predictable portion of world still has at least one very serious problem that statistics can help with. This predictable landscape is incredibly vast. These fields of regularly repeating observations stretch as far as the eye can see. These predictable data streams are so numerous that it is impossible for people to pay attention to even a small fraction of them ... and sometimes, however rarely, things do go wrong.

The ability to notice unexpected observations in the middle of an ocean of expected ones is a huge benefit data science can provide that, somewhat ironically, is easy to overlook.The analysis that notices the unexpected is termed "anomaly detection". It allows one to find the observations that don't fit, at machine scale. Finding these unusual features in an enormous predictable landscape makes it easier for people to fix problems early and potentially discover unknown dynamics.

Anomalies are often defined in contrast to the normal pattern rather than by particular features. They are simply unexpected observations or patterns.

The analytical techniques used for prediction are actually very well suited for anomaly detection. They allow us to define what is expected so that we can see what is unexpected.

Here is one possible process for an anomaly detector:

- Develop Expectation: Use past data and patterns to predict a range that would almost always include the next observation or set of observations.
- Compare Observations and Expectations: Once you have measured the next observation compare it with your expectations. If the observation matches your expectations, then it is a normal reading. However, if the observation diverges from the expectations, you have found an anomaly.

Here is a plot of the weekly search volume on Google for the term "puppy" over the last few years.

For the most part, this data follows a pretty regular repeating pattern. Each year looks similar to the year before. Let's see if we can build an anomaly detector of the type mentioned above for this series.

We can use a time series model to predict the next few steps in this series from past data. After a little analysis, an ARIMA(0, 1, 2)(1, 1, 0)x52 model seems to fit pretty well. Let's use this to calculate prediction intervals around each data point given the data before that point.

Now that we have an idea what is normal for this series, let's see how well the data matches our expectations. We can do this by simply seeing which points fall outside our prediction interval.

This anomaly detector has marked several big spikes (and often the abrupt end of the spike) as anomalies. I believe this is the result of the Puppy Bowl that Animal Planet airs at the same time as the Super Bowl.

However, even among these spikes there is notable variation. For example, in 2014, a Super Bowl ad named "Puppy Love" was particularly popular and increased the Super Bowl bump above its already unusually high level.

As an example, I also included an artificial anomaly in this series. I made this one the type that are easy for humans to miss: the lack of a spike where there normally is one. For a week in January of 2014, I lowered the value where there would normally be a relatively high search volume. The anomaly detector spotted this one too.

A large part of our world hums along in a pleasantly predictable way. My heartbeat is usually pretty predictable, as is my daily routine, and that of the companies and systems I often interact with. However, sometimes things do go wrong or change suddenly. Whether this is with my heartbeat, the quarterly sales report, or millions of other variables, I hope an anomaly detector is watching and can draw attention to these unusual observations.

This is a guest post by John Geer. If you like my work, consider connecting to me on LinkedIn.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central