The Noise in Modern Data Quality - DataScienceCentral.com

High quality level concept – quality level button on high position. 3d rendering

The need for high-quality, trustworthy data in our world will never go away.

With the growth in data, the need arises more than ever before. Even though we have evolved from data silos to pipelines (ELT/TL) to streaming to modern data stack/warehouse, multi-cloud, and data mesh — we are still faced with an age-old problem of trusting data i.e., what data is good for what purpose? What data can be used where? How can we improve it? What data is sensitive? It begs the question of why this remains unanswered after so many decades and with so many disciplines of Data Management such as Data Governance, Data Quality, Data Observability, Data Catalog, Master Data Management, Data Remediation, Data Discovery, Active Metadata Management, Data Privacy, Data Intelligence, Data Lineage, Reference Data Management, and on and on…We talk and do so much that the intent and purpose of why we started is lost i.e. the objective of the trust and improving data. Trust me, I am with you on this. If you don’t believe me, see my hand-drawn picture of me.

For those brave data folks who dared to practice all those disciplines and still hanging in with me on this blog, this is my effort to clear the noise and point out what is needed.

For the weak souls who are going after the next hype — bye!

First, let’s define some factors to identify and show noise in the space of Modern Data Quality.

Scale with an Increase in data (Scale)— With the rise in data, we need a platform that can scale and handle large datasets, diverse types of data, and more importantly an ability to support different types of architectures. When you are dealing with petabytes of data, and harvesting metadata and data, speed and scale are very very important but what is also important is the ability to scale with a no-code to low-code approach.
Each organization is unique (Context) — All organizations are not equal. Even within a vertical, and with similar technologies and architecture and same data, it is still different because the processes and people, and customers are different. Therefore the way we view the data, and measure and understand is completely different. In other words, each organization has something unique in terms of context and culture. If we don’t understand the data from a context, it’s pretty much useless putting any solution. E.g. range of annual income and its treatment are very different in terms of marketing vs. risk analysis vs. underwriting. Same number, same values but different views and interpretations.
Support as organizations evolves (Maturity) — Mergers & Acquisitions are inevitable as an organization grows and with that comes a blend of complexities and a need to handle diverse environment plus the need to support different levels of maturity as organizations grow. Else you are talking every year about a new platform and it’s not gonna work! We see the span at which any startup turns into a unicorn has tremendously decreased and the ability to support different maturity levels is a must.
Relevance to Business Value (Impact) — Sometimes we focus too much on data, that we forget the prime focus on business value. It is not about observing or monitoring or alerting or identifying nulls and blanks or looking at various dimensions. Also, it’s not about using detecting outliers from a data perspective but it’s about how well we relate to the strategic initiatives across different functions. For E.g. a price reduction to drive retention in a tight economy targeted toward specific demographics may show as an outlier but it was meant to be per the business strategy. The last thing anyone wants to do is get spammed — doesn’t matter if it’s email or slack or spending hours on root cause analysis to figure out it’s OK! It has to make business sense and resonate well and work harmoniously with business solving missions and goals.
Lack of Time to value (Time & Cost) — if you are implementing a rich processes-based platform or a heavy layer of governance with people, processes, and technology, forget it! You are wasting time and cost. By the time you finish, your data landscape is most likely changed and so does your regulatory needs and customer expectations. Time to value should be in days, not weeks, not months, definitely not years!
Support users of all types (Stewardship) — In any organization, we have two types of users — business and technical. Both of these users are relevant for improving quality and enabling a trusted framework otherwise we are not talking the same language but building more silos and barriers.

Welcome to identifying the noise in the Modern Data Quality Series, and in the next set of blogs, we will see how each of these above-mentioned disciplines has either failed to keep promises or the emerging school of thought lacks some of these attributes and is bound to fail.

My intent and goal of this series of blogs are to help data leaders and influencers to not end up as a victim of fads or broken ships.

Hang in there until the next part…ciao!