Dealing with Outliers is like searching a needle in a haystack
This is a guest repost by Jacob Joseph.
An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low??
There are parametric methods and non-parametric methods that are employed to identify outliers. Parametric methods involve assumption of some underlying distribution such as normal distribution whereas there is no such requirement with non-parametric approach. Additionally, you could do a univariate analysis by studying a single variable at a time or multivariate analysis where you would study more than one variable at the same time to identify outliers.
The question arises which approach and which analysis is the right answer??? Unfortunately, there is no single right answer. It depends for what is the end purpose for identifying such outliers. You may want to analyze the variable in isolation or maybe use it among a set of variables to build a predictive model.
Let’s try to identify outliers visually.
Assume we have the data for Revenue and Operating System for Mobile devices for an app. Below is the subset of the data:
How can we identify outliers in the Revenue?
We shall try to detect outliers using parametric as well as non-parametric approach.
The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in 'green' and log normal distribution is shaded in 'blue'. The probability density for the actual distribution is calculated from the observed data, whereas for both normal and log-normal distribution is computed based on the observed mean and standard deviation of the Revenues.
Outliers could be identified by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater/lesser than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.
In the above case, if we assume a normal distribution, there could be many outlier candidates especially for observations having revenue beyond 60,000. The log-normal plot does a better job than normal distribution, but it is due to the fact that the underlying actual distribution has characteristics of a log-normal distribution. This could not be a general case since determining the distribution or parameters of the underlying distribution is extremely difficult before hand or apriori. One could infer the parameters of the data by fitting a curve to the data, but a change in the underlying parameters like mean and/or standard deviation due to new incoming data will change the location and shape of the curve as observed in the plots below:
The above plots show the shift in location or the spread of the density curve based on an assumed change in mean or standard deviation of the underlying distribution. It is evident that a shift in the parameters of a distribution is likely to influence the identification of outliers.
Let’s look at a simple non-parametric approach like a box plot to identify the outliers.
In the box plot shown above, we can identify 7 observations, which could be classified as potential outliers, marked in green. These observations are beyond the whiskers.
In the data, we have also been provided information on the OS. Would we identify the same outliers, if we plot the Revenue based on OS??
In the above box plot, we are doing a bivariate analysis, taking 2 variables at a time which is a special case of multivariate analysis. It seems that there are 3 outlier candidates for iOS whereas there are none for Android. This was due to the difference in distribution of Revenues for Android and iOS users. So, just analyzing Revenue variable on its own i.e univariate analysis, we were able to identify 7 outlier candidates which dropped to 3 candidates when a bivariate analysis was performed.
Both Parametric as well as Non-Parametric approach could be used to identify outliers based on the characteristics of the underlying distribution. If the mean accurately represents the center of the distribution and the data set is large enough, parametric approach could be used whereas if the median represents the center of the distribution, non-parametric approach to identify outliers is suitable.
Dealing with outliers in a multivariate scenario becomes all the more tedious. Clustering, a popular data mining technique and a non-parametric method could be used to identify outliers in such a case.
Comment
Please note that "The density curve for the actual data is shaded in ‘pink’" must be read with a bit of salt. There is no density curve for actual data; you have density approximations using a number for nonparametric methods, usually kernel based. So when you plotted your pink curve, you already used a nonparametric method.
Hi Jacob, thank you for the reality check. You are right, I bluntly rammed an open door here.
I am often running into the problem of people applying black-box methodology to problems while ignoring underlying assumptions and whether they make sense for their own problem.
I stand by my statement that without assumption on the original distribution you cannot make sense of the output of the outlier detection method which I think is often the pitfall of many such analysis techniques. In other words, whether a detected outlier is really an outlier for your data is more complex than thresholding above, say 1.5*IQR+Q3.
Hey Benjamin, As you know, Parametric tests involve assumption of specific probability distributions while Non-parametric tests are distribution-free tests because they are based on fewer assumptions. Consider popular non-parametric tests like Kolmogorov-Smirnoff, etc. Even after you calculate the statistic, you still have to assume an alpha or significance level to make conclusions. That alpha level like 1%, 2%, 5% itself is an assumption.
I agree with you that unless you do not know the shape of the distribution you are sampling from, it is not possible to have a non-assuming non-parametric outlier detection method. But, in practice, like in most other non-parametric tests, there is some assumption though much less than parametric.
Nice quick introduction into the field of outlier detection though incorrect as far as the non-parametric approach goes. The definition of what constitute an outlier on a "boxplot" approach is "whatever-falls-outside-of-the-whiskers". However, the location of the whiskers is based on conventions. I most often saw them defined as being 1.5 times the interquartile range above and below Q3 and Q1 respectively. However, in some applications people will use the 9th and 91st percentiles or 2nd and 98th percentile. It is easy to see that this method is not a non-parametric one either.
Overall, if you do not know the shape of the distribution you are sampling from, it is not possible to have a non-assuming non-parametric outlier detection method.
The article seems a good general purpose way to detect outliers. Detecting outlier with clustering as can be read by clicking the hyperlink - repost above explains K-means clustering quite well, especially for beginners such as me.
© 2018 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central