Outliers are patterns in data that do not confirm to the expected behavior. While detecting such patterns are of prime importance in Credit Card Fraud, Stock Trading etc. Detecting anomaly or outlier observations are also of importance when training any of the supervised machine learning models. This brings us to two very important questions: concept of a local outlier, and why a local outlier?
In a multivariate dataset where the rows are generated independently from a probability distribution, only using centroid of the data might not alone be sufficient to tag all the outliers. Measures like Mahalanobis distance might be able to identify extreme observations but won’t be able to label all possible outlier observations. The local outlier factor is a density-based outlier detection method derived from DBSCAN; the intuition behind the approach is that the density around an outlier object will be significantly different from the density around its neighbors.
In the figure above, the red dot (C) indicates the center of the data; a distance-based method will only be able to identify o3 and o4 as the outliers, while in reality o3 is an outlier while o4 is not. Also, o1 and o2 are two outlier observations, considering the local neighborhood while based on global data distribution; they are not classified as outliers.
Local outlier factor is a density-based method that relies on k-nearest neighbors. The LOF method scores each data point by computing the ratio of the average of densities of the neighbors to the density of point itself. The steps involved in computing the LOF are as below:
The LOF value of less than 1 is an indicator that the observation is not an outlier, while a value greater than 1 indicates an outlier. The cut-off can be higher if the data is sparse.
R and python both have implementation of LOF. In R the technique can be implemented using the packages DmWR and Rlof (parallel implementation). Scikit learn in python has the implementation of the algorithm. Below is an illustrative example of LOF technique on a simulated dataset.
Disadvantage of LOF:
Advancements to overcome the disadvantages:
The drawbacks of LOF can be overcome using the some of the latest advancements in algorithms and spark- and Hadoop-based infrastructure. Below are some algorithms that can be looked into from a scalability perspective:
FICO Xpress Insight offers an excellent platform to develop applications, involving both predictive and prescriptive analytics. To learn more, visit the website https://community.fico.com/s/optimization