Practical guide to understand, build, and execute data quality & cleaning process
In the book ‘Bad Data Handbook’ Q Ethan McCallum has rightly said, “We all say we like data, but it’s not the data but the insights that we derive from it are what we care about.” Yet, a data analyst gets to dedicate only 20% of her time to the art and science of generating insights out of data. The rest of her time is spent in structuring and cleaning the data.
In order to minimize the time investment in data cleaning, there is a need of standardized frameworks and tools that work for the diverse data and business use cases across industries, functions, and domains.
This blog aims to equip you with the knowledge you need to build and execute such standardized data quality frameworks that work for your data and use cases. We will focus on:
Data Quality is the method of assessing the quality of the data by examining for inconsistent, inaccurate, incomplete, or unreasonable data values.
Data Cleaning is the method of correcting the data based on its quality.
Data quality and cleaning methods run in a loop till we get to the desired state of data quality. It is the use case(s) and the data properties that define all the components of this process: such as the number of data metrics or checks to measure, level of automation needed, sophistication of error handling, and more.
For example, the sales data of an organisation powers multiple dashboards for the sales, marketing and growth teams everyday. Each of these teams rely on the underlying data to be accurate. In this scenario, the process should incorporate all possible checks, maintain a data quality report which is shareable, and have run logs to trace back errors. Whereas if we are running a specific analysis with advertising data that will not be used by anyone else apart from us, then the data checks need to be applied only on the relevant columns of the table that are crucial for that analysis.
Data quality and cleaning both function on a common thread that we like to call Data Quality (DQ) Metrics. They lay the ground on which quality is assessed and cleaning measures are defined. Each metric is like a Q&A and each response has an associated action.
For example, the count of missing observations in column X is a DQ metric. The value of this metric helps in assessing the data quality and accordingly constructs the cleaning action, like replacing all missing values in X with the median value of X.
To build your data quality process, it is crucial to understand the following properties of data & associated use cases which determine your entire data quality and cleaning journey.
It begins with a simple question: what all can go wrong with this dataset?Schema, Data format, Data Context, Use case(s), and Output: These are the properties of data & use case(s) that will help in answering this question and build your exhaustive list of DQ metrics.
Based on our experience, we have bucketed DQ metrics into three levels:
Level 1: Generic DQ metrics that should be checked irrespective of the type of dataset type or the use case. Few example:
Level 2: Contextual checks which vary across different datasets and use cases. For example:
Level 3: Comparative checks where we compare the data table with an ideal or master data table.
Along with a list of DQ metrics, knowledge around properties such as Reusability, Repeatability, Scale, and the Tech Stack helps in executing the DQ metrics to assess quality and accordingly choose the right set of tools & technology to clean the data
To illustrate this better, we are taking three scenarios, similar to what we’ve encountered in the past, that had different combinations of these characteristics, to demonstrate different execution approaches & methods.
The data of the national electricity scheme was small monthly time series data of all districts in India in an .xlsx format (2 files with less than 25k rows & 15 columns). As it was being created for a one-time use case (for demonstration in a meeting), the tasks involved had no repeatability. However, it is preferable for the process to be reusable/reproducible anyway so as to take in account multiple iterations of data processing before reaching the final desired output.
The best suited tool to complete the task can either be a no code open source tool or a script in R (due to our familiarity with the language) that would generate a CSV that would in turn power a R shiny dashboard.
We used R to assess the data quality and generate clean structured files for our R shiny dashboard.
In this case, the scheme data is at a village level (6 lakh+ Indian villages) for 5 years along with monthly satellite imagery that is 10x more granular than the villages of India. Hence, the data size is big, input data formats are multiple — CSVs and geo-spatial raster (images), the output dashboard needs a shapefile with data tables to be able to plot maps, and we require the cleaning code to be reusable.
In this case, we use R as it can handle CSVs along with spatial vector and raster formats. We also prefer to use QGIS as it’s great for visual validation. Following are some of the additional checks, beyond ones mentioned in case study 1, that we run on satellite imagery and aggregated geo-tagged village data files.
In this case, the highlight is the automated ETL workflow that will power the live dashboard. Along with the checks we’ve covered in cases I and II, we also need to make right decisions and execute on the following data and use case properties:
List of additional actions taken to fulfil these properties-
Docker is an open-source tool that helps in making an application or set of scripts, machine-agnostic. Docker introduced the container concept that is a standardized unit of software that allows developers to isolate their app from its environment, solving the “but it works on my machine” headache. And as our ETL process was built on our system but was supposed to be running on client’s system, they needed to be dockerized.
Apache Airflow is the system that supports the auto-run of this data pipeline and provides a UI to the user to view logs, monitor the pipeline, etc. It is an open source workflow management tool that was started by Airbnb in 2008. Each scheme in our case is a separate workflow or a DAG (Direct Acyclic Graph) which is each composed of different nodes. Each node performs one step: fetching the data from API, storing the data, checking it, cleaning and transforming it and finally writing the final output to the specified destination. There can be multiple inputs and outputs coming and going to different sources. For each node we can track how much time it takes to run or how many times the run failed, etc. to evaluate the net time it will take for the data to update on our dashboard.
This setup also helps in an easy handover of the project to the other team who will take care of monitoring and updating of scheme’s data in the future. Also, the logs and alerts are crucial in this setup because they help in troubleshooting and identifying exactly at which step the issue is located. We can use simple, descriptive print statements and other logging libraries in both R and Python like log4r, logging, tryCatch.
Alternatively, today there are some interesting tools available in the market that can help you build a similar process much faster with lesser resources-
In this blog we have tried to demystify what data cleaning and quality actually entails so that you can make the most optimal use of your time. The exercise to list down data quality metrics and their associated data cleaning actions can help preempt data issues & optimise the cleaning effort before starting the analysis or modeling part. The knowledge of available tools & products helps you learn & pick the right ones compatible to your skillset and tech stack of your team.
Data, no matter how big, can only produce useful insights if it’s of good quality. Else, it simply amounts to garbage in, garbage out.
Post data cleaning and quality comes the most awaited part, the data analysis. Our next blog ‘Enroute to ML — Taking Statistics & Mathematics to the Next Level’ will cover the journey of statistics & mathematics to Machine Learning.
This is the second article in the series of “Data Science for analytical minds”. For more context to the series, please read through its introduction.
Data cleaning concept-
Error handling in R-
My co-author Ankita mathur and I have spent 6+ years collectively working on data projects ranging from social schemes and tracking dashboards to location intelligence and building alternative data APIs for international philanthropies, NGOs, state & central government bodies, and various corporate businesses.
Through these years, we discovered patterns in data quality and cleaning processes which helped us build frameworks & tools to minimise our data processing & cleaning time, and run these massive projects at SocialCops.
It was in early 2018 when we realised that these tools and frameworks can be used by data teams across the world. That’s when we started our journey with Atlan — building tools to help data teams reduce their time investment in data quality, cleaning, setting up ETL processes and spending more time in finding meaningful insights out of their data.
As part of our community efforts, we have helped the data teams around the world to build their data quality & cleaning process through blogs, courses, seminars, and hands-on workshops.
And special thanks to our brilliant editor on this series- Aishwarya Bedekar (with a promise to never start a sentence with “and” again)