This post is a summary of 3 different posts about outlier detection methods. You can find the original posts with detailed implementation in below links:
One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
There are several approaches for detecting Outliers. Charu Aggarwal in his book Outlier Analysis classifies Outlier detection models in following groups:
In this post we briefly discuss proximity based methods and High-Dimensional Outlier detection methods.
Proximity Based Methods
Proximity based methods can be classified in 3 categories: 1) Cluster based methods 2)Distance based methods 3) Density based methods
Cluster based methods classify data to different clusters and count points which are not members of any of known clusters as outliers. Distance based methods in the other hand are more granular and use the distance between individual points to find outliers.
Local Outlier Factor method discussed in this post is one of density based methods. Consider below figure:
Distance based approaches will have problem finding an outlier like point O2. Because the points in cluster C1 are less dense compare to cluster C2. If we chose a large threshold to capture an outlier like O2, many of the points in C1 will be counted as outliers.
Cluster based approaches have similar problems. Because they only consider the distance between point and centroid of cluster to calculate outlier score. The density based approaches and specially LOF approach discussed here are sensitive to densities and those approaches are more appropriate for calculating local outliers.
Below are main steps for calculating outlier score using LOF:
4.Then we calculate the local reachability density of each point using below formula:
5. Finally, we calculate LOF Scores using below formula:
You can find the complete implementation of LOF method in this post. The LOF score generated for regular points will be close to 1. The score for outliers will be far from 1. Below histogram shows results of application of this approach to famous diamonds data set:
High Dimensional Outlier Detection
Many real world data sets are very high dimensional. In many applications, data sets may contain hundreds or thousands of features. In those scenarios because of well known curse of dimensionality the traditional outlier detection approaches such as PCA and LOF will not be effective. High Contrast Subspaces for Density-Based Outlier Ranking (HiCS) method explained in this paper as an effective method to find outliers in high dimensional data sets.
LOF method discussed in previous section uses all features available in data set to calculate the nearest neighborhood of each data point, the density of each cluster and finally outlier score for each data point.
There is a detailed proof available in this paper that shows that as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor.In other word, contrast in distances to different data points becomes nonexistent. This basically means using methods such as LOF, which are based on nearest neighborhood, for high dimensional data sets will lead to outlier scores which are close to each other.
The HiCS method basically uses the following steps to deal with curse of dimensionality in outlier detection problem:
The complete implementation of the HiCS method is available in this post.