Subscribe to DSC Newsletter

Hello All,

I am new to Data Science,I wanted to know what are the Best Practice during Data Preparation.

Like Converting Integer into Category.Is it good practice to Categorize the Data.

e.g Data Contains Age column.Is it good practice to Club Age into different Category.

Please let me know if there are any article which i can refer for the same.



Tags: Best, Data, Practice, Science

Views: 855

Reply to This

Replies to This Discussion

The grouping you are talking about is called binning. See articles on this topic:

Dear Nitish,

Data preparation appeared to be the tedious  and the most essential section in the modeling phase.

It could be time consuming if it is appropriately done. As the adage said:" Garbage In ==> Garbage Out".

However, data audit would be crucial prior to Data Prep including but not limited:

  1. Pct of missing values, valid records
  2. Breakdown counts of invalid values
  3. Are observations duplicated?
  4. How many unique values within each variables?
  5. Distribution within each attributes
  6. If ND (compute the mean and the sd)
  7. If NND( compute the median and the iqr)
  8. Detect outliers and extreme values.
  9. Check association among categorical variables.
  10. Correlation with the dependent variable if you are one identified.

Thank you Nitish. I have been looking for something like this

IMHO, although it is a common in the medical journals, it is never a good practice

to "convert" ratio data (e.g., age) into ordinal data ("categories," bins).  It results in a loss

of the information inherently contained in the ratio data compared to its ordinal representation.

For instance take age: bins of 0-5 years, 6-10 years, 11-15 years, etc.  6 and 10 years old are

treated as "equal," while 5 and 6 years old are treated as "different" as 1 and 10 years old.



  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service