Subscribe to DSC Newsletter

In this article, Data Scientist Pramit Choudhary provides an introduction to both statistical and machine learning-based approaches to anomaly detection in Python. Introduction: Anomaly Detection 

This overview is intended for beginners in the fields of data science and machine learning. Almost no formal professional experience is needed to follow along, but the reader should have some basic knowledge of calculus (specifically integrals), the programming language Python, functional programming, and machine learning. 

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.

This overview will cover several methods of detecting anomalies, as well as how to build a detector in Python using simple moving average (SMA) or low-pass filter. 

Before getting started, it is important to establish some boundaries on the definition of an anomaly. Anomalies can be broadly categorized as:

  1. Point anomalies: A single instance of data is anomalous if it's too far off from the rest. Business use case: Detecting credit card fraud based on "amount spent."
  2. Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.
  3. Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack. 

Anomaly detection is similar to  but not entirely the same as — noise removal and novelty detection. Novelty detection is concerned with identifying an unobserved pattern in new observations not included in training data  like a sudden interest in a new channel on YouTube during Christmas, for instance. Noise removal (NR) is the process of immunizing analysis from the occurrence of unwanted observations; in other words, removing noise from an otherwise meaningful signal. 

What you will find in the article:

  • Anomaly Detection Techniques

              - Simple Statistical Methods
              - Challenges

  • Machine Learning-Based Approaches
              - Density-Based Anomaly Detection
              - Clustering-Based Anomaly Detection
              - Support Vector Machine-Based Anomaly Detection
              - Building a Simple Detection Solution Using a Low-Pass Filter
              - Moving Average Using Discrete Linear Convolution
  • Conclusion

To check out all this information (including source code and charts with explanations), click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 20024

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2018   Data Science Central™   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service