*Originally posted by Vincent Ajayi. *

The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no value for the variable, we refer to it as missing data. Missing data can have severe effects on a statistical model and ignoring it may lead to a biased estimate that may invalidate statistical results.

In this article, I will suggest ways to resolve the problem of missing data. Although different studies have suggested various methods to deal with missing data, I have noticed that none of these methods have theoretical or mathematical support to justify their processes. In this article, I will analysis the nine essential steps a data scientist must follow to address the issue of missing data. The steps are based on my personal experience as a quantitative researcher and data scientist for more than 7 years.

**Basic steps for dealing with missing data**

**Aims and objectives**: Before jumping to any method of estimating missing data, we must know the motivation behind the project to identify the research problem. The aim of the project must be outlined to specify key variables that are likely to be relevant for the project. You must be able to list the relevant data that can help answer the questions that define the objectives of the project.**Check for the appropriate variable:**If you have been provided with the dataset, ask yourself a question: does the dataset contain all the relevant variables needed to address the research questions? For example, a data scientist may be interested in predicting inflation with the help of the multivariate model, and the data received might not contain likely inflation indicators such as consumer price or GDP deflator. To address the issue, you should contact your line manager or the data department to provide you with the appropriate dataset that contains the relevant variable.**Visualise the data and check for the missing value:**If there is a missing value, check with the database; remember the best approach for finding a missing value is to look for the value at the source. It may be possible that there are problems with the extraction process.**Variable substitution:**A straightforward way to deal with missing data is to substitute the variable with a similar indicator, especially if a large percentage of the data is missing. I strongly suggest using another indicator to replace the missing value, especially for continuous variables. For example, the GDP deflator could be used instead of the consumer price index to measure or forecast inflation. However, one needs to be careful in applying this method because different proxies for different variables may lead to different outcomes or results.**Mean/ Mode/ Median substitution:**This method can be applied if the percentage of the missing value is smaller (e.g., less than 30%). For continuous variables, the missing value can be replaced by its median or mean value. For the category variable, the missing value can be replaced by its model value. The limitation of this method is that it reduces the variability of your data.**Delete the missing attribute**: If a large percentage of the data is missing (e.g., more than 30%), all the rows or columns can be dropped, if the variable is an independent variable and not depend on the dependent variable as well as not relevant to the model. For example, if you want to use multiple regression to predict revenue and have a variable on a product number that has a missing number, the variable could be removed instead of filling the missing value. Note that you may lose samples, important information and underfit the model.**Evaluation****and prediction**: You can use different statistical models or theoretical models to estimate or predict the missing value. For instance, statistical models can estimate or predict the missing value from the available dataset.**Apply sophisticated statistical models that are robust in****handling missing data without requiring imputation**: For example, if you have missing data, the XGBoost model can be applied for prediction instead of using linear regression. The XGBoost model will handle the missing values by default. The model will minimise the training loss and choose the best imputation value for the dataset when the value is missing.**Sample reduction:**This step applies to the time-series data, if you have missing data, the sample can be reduced to avoid looking for the missing value and base the estimation on a reduced sample that does not has missing value. Note that sample reduction can significantly affect the precision and accuracy of the results**.**

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central