Does your database contain dirty data? Now, before anyone starts to get flustered and thinks that we're referring to stuff like "Sleazy Internet Gals and Guys" (dot com!!), rest assured, that's not the case. For that matter, we're also not referring to data that takes bribes, been dragged through a septic tank, or is a nuclear bomb designed to spread the maximum amount of lethal radiation possible.

What we're actually talking about here is data that is out of date and flat-out wrong, hence it's "dirty". How much of a "thing" is dirty data? Let's take a look ...

Some Sobering Statistics
The article "6 Quick Dirty Data Stats" tells us that, according to sources such as Dun & Bradstreet, Lyris Technologies, and Sales & Marketing Institute, up to 20% of addresses and 18% of phone numbers change every year. Not only that, every year up to 21% of all CEOs change, and almost one-third of email addresses on a "house" file end up outdated.

"Ummm...when was the last time this data was checked out?"

Here's a final factoid: every year, up to two-thirds of people change job functions or companies.

If your job depends on leads and contacts, it is absolutely essential to have the correct information. If you're trying to woo the CEO of a company that you want to do business with, you'll start off on the colossally wrong foot if you get their name wrong or are unaware that the company changed CEOs without you knowing.

That dirty data can end up as egg on your face.

The Dirt Comes In All Forms
Dirty data can manifest itself in many ways. For instance, the data can be incorrect, inaccurate, redundant, inconsistent, or incomplete. Incidentally, the difference between incorrect and inaccurate is that the latter uses correct concepts in the wrong way, like Paris, Italy. Both Paris and Italy are correct terms, but they don't belong together. Paris is not in Italy. Incorrect data exists outside the normal ranges, such as having a date like January 35th.

What Can Be Done?
Since there are no virtual bars of soap out there, what can be done to clean up dirty data? A lot of it depends on just how much data you have to work with. There are data management platforms that you can avail yourself of, in essence outsourcing the work to an outside company. That's a good solution if you have more money than time on your hands. Considering how much money is lost due to dirty data, it's a worthwhile investment.

If data cleansing is going to happen in-house, it's going to be a task and a half. The first step is to make sure that all of your data, regardless of dirt content, has been backed up, on the off-chance that attempts at cleaning the data somehow make the problem worse. This way, you can always go back to the point before you started cleaning.

Once backed up, the data should then be run through a spell-checker, which should catch a lot of the above-mentioned incorrect data. In other words, any processes that can be automated should be done first.

Then comes the hard part: actually going through the data and checking for things like extra spaces, inaccuracies, and the like. It's a lot of work, and best divided amongst members of a team. Many hands make light work.

Unfortunately, there's no fast easy data cleaning tool that can be universally applied to every company and individual's data base. But ignoring the problem of dirty data could result in the loss of quite a chunk of change.

Views: 1159

Tags: big data, dirty data


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service