Outliers are patterns in data that do not confirm to the expected behavior. While detecting such patterns are of prime importance in Credit Card Fraud, Stock Trading etc. Detecting anomaly or outlier observations are also of importance when training any of the supervised machine learning models. This brings us to two very important questions: concept of a local outlier, and why a local outlier?

In a multivariate dataset where the rows are generated independently from a probability distribution, only using centroid of the data might not alone be sufficient to tag all the outliers. Measures like Mahalanobis distance might be able to identify extreme observations but won’t be able to label all possible outlier observations. The local outlier factor is a density-based outlier detection method derived from DBSCAN; the intuition behind the approach is that the density around an outlier object will be significantly different from the density around its neighbors.

In the figure above, the red dot (C) indicates the center of the data; a distance-based method will only be able to identify o3 and o4 as the outliers, while in reality o3 is an outlier while o4 is not. Also, o1 and o2 are two outlier observations, considering the local neighborhood while based on global data distribution; they are not classified as outliers.

Local outlier factor is a density-based method that relies on k-nearest neighbors. The LOF method scores each data point by computing the ratio of the average of densities of the neighbors to the density of point itself. The steps involved in computing the LOF are as below:

- Calculate distance between the pair of observations.
- Find the kth nearest neighbor observation; calculate the distance between the observation and k-Nearest neighbor.
- Calculate Local Reachability Density (LRD): LRD is the most optimal distance in any direction from the neighbor to the individual point. Mathematically it’s the count of items in k’s nearest neighbor set divided by the max of the kth nearest neighbor of the point and the distance between the point and its adjoining point. The distance is referred as reach distance.
- Local Outlier Factor(LOF) as defined above will be the ratio of local reachability of o and o’s k-nearest neighbors
- The lower is the local reachability of o and the higher the reachability of k nearest neighbors of o higher the LOF.

The LOF value of less than 1 is an indicator that the observation is not an outlier, while a value greater than 1 indicates an outlier. The cut-off can be higher if the data is sparse.

R and python both have implementation of LOF. In R the technique can be implemented using the packages DmWR and Rlof (parallel implementation). Scikit learn in python has the implementation of the algorithm. Below is an illustrative example of LOF technique on a simulated dataset.

Disadvantage of LOF:

- Scalability is a big issue as the computations are memory-intensive and hence, a good computing infrastructure is required for bigger datasets
- The user needs to provide the input “k”, this input might not be optimal giving incorrect results

Advancements to overcome the disadvantages:

The drawbacks of LOF can be overcome using the some of the latest advancements in algorithms and spark- and Hadoop-based infrastructure. Below are some algorithms that can be looked into from a scalability perspective:

- DILOF: Effective and Memory Efficient Local Outlier Detection in Data streams
- Scalable Top-n local outlier detection: Uses pruning to eliminate most points from the set of potential outlier candidates without computing LOF or even calculating k nearest neighbors.

FICO Xpress Insight offers an excellent platform to develop applications, involving both predictive and prescriptive analytics. To learn more, visit the website https://community.fico.com/s/optimization

References:

- Breunig, M. M., Kriegel, H., Ng, R. T., & Sander, J. (2000).LOF: Identifying Density Based Local Outliers
- Yizhou Yan (Worcester Polytechnic Institute);Lei Cao (Massachusetts Institute of Technology);Elke Rundensteiner (Worcester Polytechnic Institute) Scalable Top-n Local Outlier Detection
- Gyoung S. Na, Donghyun Kim, Hwanjo Yu POSTECH (Pohang University of Science and Technology), South Korea {ngs00, kdh5377, hwanjoyu}@postech.ac.kr. DILOF: Effective and Memory Efficient Local Outlier Detection in Da...

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central