When Good Data Goes Bad - DataScienceCentral.com

Hign angle above view photo of it skilled specialist guy sitting chair looking many monitors not believe eyes analyzing server back-end security bad code night office indoors — Hign angle above view photo of it skilled specialist guy sitting chair, looking many monitors not believe eyes analyzing server back-end security bad code night office indoors

You’ve all heard the saying “Garbage in, garbage out” and probably have your own horror story about wrestling data to the ground over missing values, inconsistent codes, or variable formats. It is hard to argue that you don’t need better data quality to make better decisions, but is better data quality enough?

If the formats are right and the coding is consistent, should you just trust that your data can be reconciled to its sources? Can you balance the total monthly sales back to the source systems this month? If you expected a certain distribution of data, wouldn’t you want to know if it was different? If you normally get data from 50 states every day but only have 45 today, wouldn’t you be asking questions even if it passed all of your quality checks?

When assessing how reliable your data is, you need to evaluate it from multiple perspectives:

Does your data meet your quality standards?
Can your data be reconciled to its sources?
Has there been any drift in your data?

Like a three-legged stool, data reliability needs each leg to be stable. Take any one leg away and you could have a somewhat precarious balancing act. With all three, you can sit comfortably and make your data-driven decisions with confidence.

The First Leg – Data Quality

Do the right thing!

Data quality isn’t a new topic. Many organizations have been working hard at improving their data quality for years. Your organization may already have robust sets of data quality rules that are defined and maintained by a center of excellence and cover basic to sophisticated situations.

This begs the first question: Why do you still have data quality issues? The answer is simple: Coverage. The number one problem seen again and again is not applying data quality rules across enough of your data. Often it is the most basic of checks that are missed. Are there nulls where there shouldn’t be? Is the data in the wrong format? Are the values unique?

Without good coverage of the basics, you’ll always have data quality issues. You need a better, easier way to apply these data quality checks that are essentially data stakes. Why can’t your data quality system just tell you what you need to apply? Why do you have to guess? Can’t the system do 80% of the work for you with 20% of the effort?

The second question to pop up: What about today? You know the quality was good yesterday, but what about today? Or Tomorrow? Shouldn’t these independent checks just be part of the pipeline that supplies you the data you need? If you can trust that you’ll be alerted when something is wrong, you can be comfortable that your data is right.

The Second Leg – Data Reconciliation

Someone checked this, right?

Data reconciliation is more than just knowing that you fed X rows into your pipeline and X rows were processed. It’s about knowing that business-relevant data can be tied back to its source, that aggregate sales figures tie to the transaction detail, that data in key fields didn’t get corrupted, that filters were appropriately applied. It’s about knowing all of that can be traced to today’s feed and yesterday’s feed as well as tomorrow’s feed.

Most data quality platforms delegate this responsibility to the tooling or integration platforms that move the data since it is seen as more operational. Those tools and platforms should build in the needed checks and balances to ensure proper operations, however, you shouldn’t underestimate the value of independent validation. Would you ask the accountant who posts all the credits and debits to the general ledger to also audit the books? The exercise of creating an independent reconciliation builds trust in your data and the processing of it.

Having an independent reconciliation doesn’t mean it shouldn’t be part of your overall pipeline. On the contrary, it should be triggered by your pipeline and should be a key decision point in the pipeline flow.

The Third Leg – Schema Drift and Data Drift

Something just doesn’t feel right!

Drift can come in two flavors, schema drift and data drift. Either of these could cause problems, or they could be nothing to worry about. Yet, both, if undetected, can undermine the third leg of the stool.

Schema drift can generally be defined as a change in the structure of your data. Perhaps a new column was added or an old one was dropped. Perhaps the precision or the format of a field was changed. When all your data sources were under your control, these types of changes were known events, but in today’s world of external data sources, schema changes happen all the time. Depending on how you’ve built your processing and analysis, this may or may not be an issue. However, finding out about a schema change before it starts cascading through your environment can save a lot of headaches. Consider what could happen if a vendor inserts an extra column in a daily CSV file that misaligns the columns on the file load to your cloud data warehouse. Your data quality checks should start catching this, but now you have to perform a lot of clean-up. Wouldn’t it be better to be alerted to the schema change before it had an impact?

Data drift can generally be defined as a shift in the “shape” or distribution of your data. As counterintuitive as it seems, this data may not break your best data quality rules. In fact, it may successfully be reconciled across multiple tests and may still be of concern. Consider a few scenarios:

An insurer evaluates claims reserves based on a claims-feed received every month which has data from all 50 states but has only 45 states this month.

The coding on all records is correct
All data was successfully processed and reconciled
But their evaluation of claims reserves may be wrong

A customer service department relies on an AI model for Next Best Action (NBA) in critical CSR-Client interactions. The model was trained on a certain set of demographic features.

Age is a critical feature used by the NBA model
The model yields great results when first deployed
Over time, the median age of callers shifts, and the model results drop off

An online retailer prides itself on fast fulfillment of orders and has plenty of BI platforms reporting on the volume and breakdown of orders so they can plan accordingly.

Reacting quickly to changes in order distribution is critical to their business model
On a particular day, there is a shift in orders for a popular product in a certain region
The BI platforms record it, but after the fact and no one notices the change resulting in a drop in fulfillment efficiency

In each case, there may be a perfectly valid reason for the drift in the distribution of the data. Perhaps by some statistical fluke, there really were no claims in 5 states this month. Perhaps, the AI model is still working optimally even with the shift in the age feature. Perhaps being aware of the shift in orders sooner wouldn’t have affected fulfillment. But wouldn’t it be better to know and be able to address the source or effect of these drifts before it was an issue? Wouldn’t it be better to get the complete claims data before resetting reserves, retrain the NBA model on current data, or be actively alerted to the shift in order distribution?

Finally, monitoring for data drift is great if you know the critical features to monitor for but what if you aren’t sure? Wouldn’t it be great if your data reliability system would look for anomalies in all of your data, essentially providing you an early warning system, your own “canary in a coal mine”.

Sit Comfortably, Thanks to Data Reliability

With all three legs in place, you should be able to sit comfortably and rely on your data, but only if the legs are properly united to your stool, to your data reliability framework. Then you can have a coordinated view of each aspect of data reliability: Quality, Reconciliation, and Drift. Your framework needs to automate these processes so that reliability is part of your data foundation and built into your overall data observability – your view of the data processing, data management, and data pipelines.