Hello All,

I am new to Data Science,I wanted to know what are the Best Practice during Data Preparation.

Like Converting Integer into Category.Is it good practice to Categorize the Data.

e.g Data Contains Age column.Is it good practice to Club Age into different Category.

Please let me know if there are any article which i can refer for the same.



Tags: Best, Data, Practice, Science

Views: 940

Reply to This

Replies to This Discussion

The grouping you are talking about is called binning. See articles on this topic: https://www.datasciencecentral.com/page/search?q=binning

Dear Nitish,

Data preparation appeared to be the tedious  and the most essential section in the modeling phase.

It could be time consuming if it is appropriately done. As the adage said:" Garbage In ==> Garbage Out".

However, data audit would be crucial prior to Data Prep including but not limited:

  1. Pct of missing values, valid records
  2. Breakdown counts of invalid values
  3. Are observations duplicated?
  4. How many unique values within each variables?
  5. Distribution within each attributes
  6. If ND (compute the mean and the sd)
  7. If NND( compute the median and the iqr)
  8. Detect outliers and extreme values.
  9. Check association among categorical variables.
  10. Correlation with the dependent variable if you are one identified.

Thank you Nitish. I have been looking for something like this

IMHO, although it is a common in the medical journals, it is never a good practice

to "convert" ratio data (e.g., age) into ordinal data ("categories," bins).  It results in a loss

of the information inherently contained in the ratio data compared to its ordinal representation.

For instance take age: bins of 0-5 years, 6-10 years, 11-15 years, etc.  6 and 10 years old are

treated as "equal," while 5 and 6 years old are treated as "different" as 1 and 10 years old.

Hi Nitish, 

Here are some best practices:

  1. Verify Datatypes and Formats: One will surely be faced with data from a variety of sources. In some cases, the data present in the source data store may not be available in a convenient or supported format. Conversion of data format may be required. One also must ensure that the data types used are accurate. Same type of information coming from different sources may be in different data types. A uniform type needs to be decided and data from different sources must be transformed to it.
  2. Restructuring the source data: Restructuring source data like doing pivot operations may be necessary to transform data into a suitable form.
  3. Identifying the outliers: Outliers are data points that are out of whack with the rest of the data. They are either very large or very small values compared with the rest of the dataset. Outliers are problematic because they can seriously compromise statistics and statistical procedures. A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless.
  4. Dealing with missing values: Missing values are one of the most common data problems that one encounters. Dropping such records or filling missing values with a measure of central tendency are few of the approaches for dealing with missing data.
  5. Avoid categorizing data: Data may be categorized at report level but data must be stored at he most granular level possible. From that, data can be categorized or consolidated to any level. For predictive analytics, machine learning models can’t be trained with data in categorical form. It must be converted to either nominal or numerical form.

Let me know if you have any doubts.


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service