Shahram Abyari's Posts - Data Science Central2019-09-16T04:07:02ZShahram Abyarihttps://www.datasciencecentral.com/profile/ShahramAhttps://storage.ning.com/topology/rest/1.0/file/get/2800458478?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/profiles/blog/feed?user=2xvp8bbkvakod&%3Bxn_auth=noIntroduction to Outlier Detection Methodstag:www.datasciencecentral.com,2016-01-19:6448529:BlogPost:3751262016-01-19T00:30:00.000ZShahram Abyarihttps://www.datasciencecentral.com/profile/ShahramA
<p>This post is a summary of 3 different posts about outlier detection methods. You can find the original posts with detailed implementation in below links:</p>
<ul>
<li><a href="http://shahramabyari.com/2016/01/19/detecting-outliers-in-high-dimensional-data-sets/" target="_blank">Detecting Outliers In High Dimensional Data Sets</a></li>
<li><a href="http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/" target="_blank">Local…</a></li>
</ul>
<p>This post is a summary of 3 different posts about outlier detection methods. You can find the original posts with detailed implementation in below links:</p>
<ul>
<li><a href="http://shahramabyari.com/2016/01/19/detecting-outliers-in-high-dimensional-data-sets/" target="_blank">Detecting Outliers In High Dimensional Data Sets</a></li>
<li><a href="http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/" target="_blank">Local Outlier Factor(LOF): Identifying Density Based Local Outliers</a></li>
<li><a href="http://shahramabyari.com/2015/12/25/data-preparation-for-predictive-modeling-resolving-outliers/" target="_blank">Outlier Detection Using Principal Component Analysis</a></li>
</ul>
<p></p>
<p>One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.</p>
<p style="text-align: center;"><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808309604?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808309604?profile=RESIZE_1024x1024" width="750" class="align-full"/></a></p>
<p>There are several approaches for detecting Outliers. Charu Aggarwal in his book <a href="http://www.amazon.com/gp/product/1461463955/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1461463955&linkCode=as2&tag=shahramabyari-20&linkId=HOUNEUUTGGMXYRG3" target="_blank">Outlier Analysis</a> classifies Outlier detection models in following groups:</p>
<ul>
<li><strong>Extreme Value Analysis</strong>: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. <a href="https://en.wikipedia.org/wiki/Z-test" target="_blank">Z-test</a> and <a href="https://en.wikipedia.org/wiki/Student%27s_t-test" target="_blank">Student’s t-test</a> are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.</li>
<li><strong>Probabilistic and Statistical Models:</strong> These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.</li>
<li><strong>Linear Models</strong>: These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.</li>
<li><strong>Proximity-based Models</strong>: The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.</li>
<li><strong>Information Theoretic Models</strong>: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.</li>
<li><strong>High-Dimensional Outlier Detection</strong>: Specifc methods to handle high dimensional sparse data</li>
</ul>
<p></p>
<p>In this post we briefly discuss proximity based methods and High-Dimensional Outlier detection methods.</p>
<p></p>
<p><strong>Proximity Based Methods</strong></p>
<p>Proximity based methods can be classified in 3 categories: <span>1) Cluster based methods 2)Distance based methods 3) Density based methods</span></p>
<p>Cluster based methods classify data to different clusters and count points which are not members of any of known clusters as outliers. Distance based methods in the other hand are more granular and use the distance between individual points to find outliers.</p>
<p>Local Outlier Factor method discussed in this post is one of density based methods. Consider below figure:</p>
<p><a href="http://i2.wp.com/shahramabyari.com/wp-content/uploads/2015/12/local-outlier-factor.png?resize=768%2C598" target="_blank"><img src="http://i2.wp.com/shahramabyari.com/wp-content/uploads/2015/12/local-outlier-factor.png?resize=768%2C598&width=450" width="450" class="align-center"/></a></p>
<p style="text-align: center;"><a href="http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf" target="_blank">Reference</a></p>
<p>Distance based approaches will have problem finding an outlier like point O2. Because the points in cluster C1 are less dense compare to cluster C2. If we chose a large threshold to capture an outlier like O2, many of the points in C1 will be counted as outliers.</p>
<p>Cluster based approaches have similar problems. Because they only consider the distance between point and centroid of cluster to calculate outlier score. The density based approaches and specially LOF approach discussed here are sensitive to densities and those approaches are more appropriate for calculating local outliers.</p>
<p>Below are main steps for calculating outlier score using LOF:</p>
<ol>
<li>First we find the K-nearest neighbors of each point in dataset. Selecting the right K has been discussed in the paper</li>
<li>We call the max distance to K-nearest points that we found in previous step K-distance. For example, for the first point if used K=3 and found the 3 nearest neighbors have distances of 1.2, 2.5 and 6.4 the k-distance for this point will be 6.4.</li>
<li>Next, for certain number of points (MinPts) we calculate the reach-distance:</li>
</ol>
<p></p>
<p><a href="http://s0.wp.com/latex.php?zoom=2&latex=reach-dist_%7Bk%7D%28p%2C+o%29+%3D+max+%5C%7B+k-distance%28o%29%2C+d%28p%2C+o%29+%5C%7D+&bg=ffffff&fg=000&s=0" target="_blank"><img src="http://s0.wp.com/latex.php?zoom=2&latex=reach-dist_%7Bk%7D%28p%2C+o%29+%3D+max+%5C%7B+k-distance%28o%29%2C+d%28p%2C+o%29+%5C%7D+&bg=ffffff&fg=000&s=0&width=450" width="450" class="align-center"/></a></p>
<p></p>
<p></p>
<p></p>
<p> 4.<span>Then we calculate the local reachability density of each point using below formula:</span></p>
<p></p>
<p><a href="http://s0.wp.com/latex.php?zoom=2&latex=lrd_%7BMinPts%7D%28p%29+%3D+1%2F%28%5Cfrac%7B%5Csum_%7Bo+%5Cin+N_%7BMinPts%7D%28p%29%7Dreach-dist_%7BMinPts%7D%28p%2Co%29%7D%7BN_%7BMinPts%7D%28p%29%7D%29+&bg=ffffff&fg=000&s=0" target="_blank"><img src="http://s0.wp.com/latex.php?zoom=2&latex=lrd_%7BMinPts%7D%28p%29+%3D+1%2F%28%5Cfrac%7B%5Csum_%7Bo+%5Cin+N_%7BMinPts%7D%28p%29%7Dreach-dist_%7BMinPts%7D%28p%2Co%29%7D%7BN_%7BMinPts%7D%28p%29%7D%29+&bg=ffffff&fg=000&s=0&width=450" width="450" class="align-center"/></a></p>
<p></p>
<p> 5. <span>Finally, we calculate LOF Scores using below formula:</span></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p><a href="http://s0.wp.com/latex.php?zoom=2&latex=LOF_%7BMinPts%7D%28p%29%3D%5Cfrac%7B%5Csum_%7Bo+%5Cin+N_%7BMinPts%7D%28p%29%7D%5Cfrac%7Blrd_%7BMinPts%7D%28p%29%7D%7Blrd_%7BMinPts%7D%28o%29%7D%7D%7BN_%7BMinPts%7D%28p%29%7D+&bg=ffffff&fg=000&s=0" target="_blank"><img src="http://s0.wp.com/latex.php?zoom=2&latex=LOF_%7BMinPts%7D%28p%29%3D%5Cfrac%7B%5Csum_%7Bo+%5Cin+N_%7BMinPts%7D%28p%29%7D%5Cfrac%7Blrd_%7BMinPts%7D%28p%29%7D%7Blrd_%7BMinPts%7D%28o%29%7D%7D%7BN_%7BMinPts%7D%28p%29%7D+&bg=ffffff&fg=000&s=0&width=450" width="450" class="align-center"/></a></p>
<p></p>
<p></p>
<p>You can find the complete implementation of <a href="http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/" target="_blank">LOF method in this post.</a> The LOF score generated for regular points will be close to 1. The score for outliers will be far from 1. Below histogram shows results of application of this approach to famous diamonds data set:</p>
<p><a href="http://i1.wp.com/shahramabyari.com/wp-content/uploads/2015/12/lof_outlier_score_histogram.png?w=800" target="_blank"><img src="http://i1.wp.com/shahramabyari.com/wp-content/uploads/2015/12/lof_outlier_score_histogram.png?w=800&width=750" width="750" class="align-center"/></a></p>
<p></p>
<p><strong>High Dimensional Outlier Detection</strong></p>
<p></p>
<p><span>Many real world data sets are very high dimensional. In many applications, data sets may contain hundreds or thousands of features. In those scenarios because of well known curse of dimensionality the traditional outlier detection approaches such as </span><a href="http://shahramabyari.com/2015/12/25/data-preparation-for-predictive-modeling-resolving-outliers/" target="_blank">PCA</a><span> and </span><a href="http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/" target="_blank">LOF</a><span> will not be effective. High Contrast Subspaces for Density-Based Outlier Ranking (HiCS) method explained in </span><a href="http://www.ipd.kit.edu/~muellere/publications/ICDE2012.pdf" target="_blank">this paper</a><span> as an effective method to find outliers in high dimensional data sets.</span></p>
<p>LOF method discussed in previous section uses all features available in data set to calculate the nearest neighborhood of each data point, the density of each cluster and finally outlier score for each data point.</p>
<p>There is a detailed proof available in <a href="http://www.loria.fr/~berger/Enseignement/Master2/Exposes/beyer.pdf" target="_blank">this paper</a> that shows that as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor.In other word, contrast in distances to different data points becomes nonexistent. This basically means using methods such as LOF, which are based on nearest neighborhood, for high dimensional data sets will lead to outlier scores which are close to each other.</p>
<p>The HiCS method basically uses the following steps to deal with curse of dimensionality in outlier detection problem:</p>
<ol>
<li>First it finds High Contrast subspaces using comparison of marginal pdf and conditional pdf for each subspace</li>
<li>Next it calculates outlier score for each point based on each of high contrast subspaces</li>
<li>Finally it calculates the average of scores generated from previous step</li>
</ol>
<p><a href="http://i0.wp.com/shahramabyari.com/wp-content/uploads/2016/01/HighContrast.png?w=1585" target="_blank"><img src="http://i0.wp.com/shahramabyari.com/wp-content/uploads/2016/01/HighContrast.png?w=1585" class="align-full"/></a></p>
<p></p>
<p>The complete implementation of the <a href="http://shahramabyari.com/2016/01/19/detecting-outliers-in-high-dimensional-data-sets/" target="_blank">HiCS method is available in this post.</a></p>Resolving Skewnesstag:www.datasciencecentral.com,2015-12-25:6448529:BlogPost:3661342015-12-25T16:00:00.000ZShahram Abyarihttps://www.datasciencecentral.com/profile/ShahramA
<p>The fundamental assumption in many predictive models is that the predictors have normal distributions. Normal distribution is un-skewed. An un-skewed distribution is the one which is roughly symmetric. It means the probability of falling in the right side of mean is equal to probability of falling on left side of mean.</p>
<p><a href="http://shahramabyari.com/2015/12/21/data-preparation-for-predictive-modeling-resolving-skewness/" target="_blank">This article</a> outlines the steps to detect…</p>
<p>The fundamental assumption in many predictive models is that the predictors have normal distributions. Normal distribution is un-skewed. An un-skewed distribution is the one which is roughly symmetric. It means the probability of falling in the right side of mean is equal to probability of falling on left side of mean.</p>
<p><a href="http://shahramabyari.com/2015/12/21/data-preparation-for-predictive-modeling-resolving-skewness/" target="_blank">This article</a> outlines the steps to detect <a href="http://shahramabyari.com/2015/12/21/data-preparation-for-predictive-modeling-resolving-skewness/" target="_blank">skewness and resolve the skewness</a> of data to build better predictive models. The article specifically discusses the following:</p>
<ul>
<li>Statistics for calculating Skewness of data</li>
<li>BoxCox transformation for resolving skewness</li>
<li>Sample Python and R codes for Boxcox transformation and calculating skewness</li>
</ul>
<p><span>Finding the right transformation to resolve Skewness can be tedious. Box and Cox in their 1964 paper proposed a statistical method to find the right transformation. They suggested using below family of transformations and finding the λ:</span></p>
<p><span><a href="http://s0.wp.com/latex.php?zoom=2&latex=x%5E%7B%2A%7D+%3D+%5Cbegin%7Bcases%7D%5Cfrac%7Bx%5E%7B%5Clambda%7D-1%7D%7B%5Clambda%7D+%26+%5Clambda+%5Cneq+0%5C%5Clog%28x%29+%26+%5Clambda+%3D+0%5Cend%7Bcases%7D&bg=ffffff&fg=000&s=3" target="_blank"><img src="http://s0.wp.com/latex.php?zoom=2&latex=x%5E%7B%2A%7D+%3D+%5Cbegin%7Bcases%7D%5Cfrac%7Bx%5E%7B%5Clambda%7D-1%7D%7B%5Clambda%7D+%26+%5Clambda+%5Cneq+0%5C%5Clog%28x%29+%26+%5Clambda+%3D+0%5Cend%7Bcases%7D&bg=ffffff&fg=000&s=3&width=200" width="200" class="align-center"/></a></span></p>
<p><span>Notice that because of the log term, this transformation requires x values to be positive. So, if there are zero and negative values, all values need to be shifted before applying this method.</span></p>
<p><span>You can find sample <a href="http://shahramabyari.com/2015/12/21/data-preparation-for-predictive-modeling-resolving-skewness/" target="_blank">R and Python implementation of Boxcox transformation to resolve skewness in this post.</a></span></p>
<p><span><a href="http://i2.wp.com/shahramabyari.com/wp-content/uploads/2015/12/Skewness-BoxCox-Trans-Airline-Flight-Time-R.png?w=1260" target="_blank"><img src="http://i2.wp.com/shahramabyari.com/wp-content/uploads/2015/12/Skewness-BoxCox-Trans-Airline-Flight-Time-R.png?w=1260" class="align-center"/></a></span></p>