Subscribe to DSC Newsletter

Identify, describe, plot, and remove the outliers from the dataset with R (rstats)

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers.

There are different methods to detect the outliers, including standard deviation approach and Tukey’s method which use interquartile (IQR) range approach. In this post I will use the Tukey’s method because I like that it is not dependent on distribution of data. Moreover, the Tukey’s method ignores the mean and standard deviation, which are influenced by the extreme values (outliers).

The Script

I developed a script to identify, describe, plot and remove the outliers if it is necessary. To detect the outliers I use the command boxplot.stats()$outwhich use the Tukey’s method to identify the outliers ranged above and below the 1.5*IQR. To describe the data I preferred to show the number (%) of outliers and the mean of the outliers in dataset. I also show the mean of data with and without outliers. Regarding the plot, I think that boxplot and histogram are the best for presenting the outliers. In the script below, I will plot the data with and without the outliers. 

outlierKD(dat, variable)

Here it is an example of the data description:

Outliers identified: 58

Propotion (%) of outliers: 3.8

Mean of the outliers: 108.1

Mean without removing outliers: 53.79

Mean if we remove outliers: 52.82

Do you want to remove outliers and to replace with NA? [yes/no]: y

Outliers successfully removed

You can read the full post and see the plots at DataScience+

Views: 3286

Tags: r, rstats


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service