Subscribe to DSC Newsletter

40-year old trick to clean data efficiently, and perform fuzzy matching

While much of data cleaning is performed before loading data in a database (especially for one-time, ad hoc analyses), there is a way to do it, continuously (like once a week or once a day), once the data is in its final database. It consists of adding look-up tables to help with the messy fields.

When I was working at eBay, there was a database of clients from around the world. Some clients had names in a foreign language, containing accents and special characters. Somehow, it made some SQL joins very tricky. We created a lookup table of names, matching different spelling of a company name, to a standardized name and client ID. Think about names such as M.I.T and MIT that represent the same entity but can be spelled differently. It also helps dealing with duplicate records.This old trick allows you to do fuzzy matching, and the size of the lookup tables (updated daily) was manageable.

What do you think of this idea? Of course the best solution is to use this system, together with traditional cleaning techniques, if possible. But in systems where data is automatically uploaded and updated on a daily basis, lookup tables are very helpful.  

Views: 5242

Reply to This

Replies to This Discussion

I think the general approach has merit--we do the same thing in Excel all the time; using VLOOKUP to a table to clean data in a given column (==field).  One question on your diagram--you indicate a Union All as the first step--how do you accomplish this before the cleaning?

A few months ago I had a problem dealing with different names for the same products. My client had 27 different CRM databases (one for each brazilian state) and those databases didn't have any communication with each other. So, in order to create a national portfolio I needed to identify which names corresponded to the same products, but without a reference of correct names because that was what we were trying to create. The problem is they had more than 82.000 unique products and services registered in all those databases.

To reduce the problem I created a new variable containing each name in a lowercase form, with no accents, numbers or special characters. This step reduced the problem to 65 thousand names. Then I built a string distance matrix using R package stringdist. It took some hours to build this large matrix in my computer but it was worth it. I also had to repeat this step a few times to find the best method. With all the distances calculated, I just had to use a hierarchical cluster to group similar names together. This step also took a while because it was hard to find the best cut to the tree. On the next day, after finding the best cut, I was able to reduce the problem to 3 thousand unique names, without compromising the portfolio.

The following problem was to decide which name would be the standard name for each cluster (product). So we decided to keep the original corresponding names that were the most frequent in the database for each case.

That's a creatie approach!  I have had a lot of experience with this sort of problem.  There are various packages out there that "guarantee" they can de-dup your data (common problem--multiple names for the same customer across regional ERPs).  I looked at a couple but never thought it worth paying for.  Your approach is genralizable and portable.

Erick Costa Damasceno said:

A few months ago I had a problem dealing with different names for the same products. My client had 27 different CRM databases (one for each brazilian state) and those databases didn't have any communication with each other. So, in order to create a national portfolio I needed to identify which names corresponded to the same products, but without a reference of correct names because that was what we were trying to create. The problem is they had more than 82.000 unique products and services registered in all those databases.

To reduce the problem I created a new variable containing each name in a lowercase form, with no accents, numbers or special characters. This step reduced the problem to 65 thousand names. Then I built a string distance matrix using R package stringdist. It took some hours to build this large matrix in my computer but it was worth it. I also had to repeat this step a few times to find the best method. With all the distances calculated, I just had to use a hierarchical cluster to group similar names together. This step also took a while because it was hard to find the best cut to the tree. On the next day, after finding the best cut, I was able to reduce the problem to 3 thousand unique names, without compromising the portfolio.

The following problem was to decide which name would be the standard name for each cluster (product). So we decided to keep the original corresponding names that were the most frequent in the database for each case.

Thanks, Blaine! If you google for "R stringdist hclust" you will find some blog posts about that approach

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service