Data scientists spend 80% of their time on data cleaning and exploratory analysis. What if you could automate most of this? What if data scientists spent most of their time on higher level tasks, that better justify the salary? I explain here how to do it.
Every Data Set Looks Different
To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.
Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.
Creating a Snapshot of Your Data
First, you need to create a summary table for all features taken separately: the type (numerical, categorical data, text, or mixed). For each feature, get the top 5 values, with their frequencies. It could reveal a wrong or unassigned zip-code such as 99999. Look for other special values such as NaN (not a number), N/A, an incorrect date format, missing values (blank) or special characters. For instance, accented characters, commas, dollar, percentage signs and so on can cause issues with text parsing and regular expressions. Compute the minimum, maximum, median and other percentiles for numerical features.
The next step is to look at interactions between features. Compute all cross-correlations between any two features, and check for redundant or duplicate features that you can ignored. You can easily automate these two steps.
The Main Issues
I mentioned a few issues that are immediately apparent (if you try to detect them) in the previous paragraph. Here I add more. I conclude with suggestions about what to do when facing these problems, again with automation.
- Check for values that are out-of-range: if possible, get the expected range from your client before starting your analysis.
- Use checksums if possible, for encrypted fields such as credit card numbers or ID fields.
- Look for duplicate or almost identical IDs or keys. Also two different IDs may correspond to the same individual. This could reveal typos in your data. Working with a table of common typos can help.
- Also, collect data using pre-populated fields in web forms whenever possible, as opposed to users manually typing in their information such as state, city, zip-code, or date.
- Check for misaligned fields. This happens frequently in NLP problems, where data such as URLs are parsed and stored in CSV files before being uploaded in databases.
- Are dates encoded in different ways? This happens when blending multiple data sets from different sources. Standardize dates to address this issue.
After the cleaning process, you can standardize your data. Then you can use it into your machine learning system, and process it with your standard algorithms.
Sometimes, the data has issues beyond your control. When I was working at Wells Fargo, internet session IDs generated by the Tealeaf software were broken down into multiple small sessions, resulting in wrong userIDs and very short Internet sessions. This happened when a user moved from one server to another (in other words, it happened all the time). Manually simulating such sessions and looking how they were tracked in the database, helped solve this mystery, leading to correct analyses.
Occasionally, the largest population segment is entirely missing in the database. For instance, in Covid data, people never tested who recovered on their own (the vast majority of the population in the early days) did not show up in any database, giving a lethality rate of 6% rather than the more correct 1%, with costly public policy implications. Use common sense and out-of-the-box thinking to detect such issues, and let stakeholders known about it. Also, use alternate data sources whenever possible. In this case, sewage data — a proxy dataset — provides the answer.
Addressing the Issues
Some issues are easy to fix. For instance, replacing NaN by a blank if this is how you represent missing data. Or creating a table that lists all the ways missing data is encoded. You can fix typos or misspelled names with a dictionary, or matching cities with zip codes. You can build such a dictionary based on keyword frequencies and associations.
You can properly re-align misaligned fields using regular expressions or other techniques to detect split fields, for instance caused by a comma erroneously used as field separator. Merge duplicate features and duplicate IDs. However, errors are not always recoverable. In this case, you may ignore part of the dataset. However, this can introduce biases. Sometimes getting a new (corrected) data dump from your provider is the way to go.
This quick overview is part of my new book entitled “Intuitive Machine Learning and Explainable AI”. You can get a copy of my book, here.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.