The fundamental assumption in many predictive models is that the predictors have normal distributions. Normal distribution is un-skewed. An un-skewed distribution is the one which is roughly symmetric. It means the probability of falling in the right side of mean is equal to probability of falling on left side of mean.

This article outlines the steps to detect skewness and resolve the skewness of data to build better predictive models. The article specifically discusses the following:

  • Statistics for calculating Skewness of data
  • BoxCox transformation for resolving skewness
  • Sample Python and R codes for Boxcox transformation and calculating skewness

Finding the right transformation to resolve Skewness can be tedious. Box and Cox in their 1964 paper proposed a statistical method to find the right transformation. They suggested using below family of transformations and finding the λ:

Notice that because of the log term, this transformation requires x values to be positive. So, if there are zero and negative values, all values need to be shifted before applying this method.

You can find sample R and Python implementation of Boxcox transformation to resolve ske...

Views: 5801

Tags: boxcox, cleansing, data, modeling, predictive, preparation, skewness


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Emery on February 23, 2019 at 9:04am

Links are not working. Looks like the articles were taken down. 

Comment by leonardo auslender on December 28, 2015 at 9:54am

"The fundamental assumption in many predictive models is that the predictors have normal distributions."

Which methods? Trees, OLS and others do not require this assumption. Could you explain? Thanks.  

Comment by Shahram Abyari on December 26, 2015 at 12:40pm

Thanks Mark... I fixed the link...

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service