Subscribe to DSC Newsletter

Address Data Matching Does Not Have to be a Resource-Draining Challenge. Here's How You Can Do it Better.

Address data is semi-structured, making it one of the most challenging components in a data matching activity. For long now, manual data matching methods including extensive SQL programming and spreadsheet formulas have been used to match address lists. While this may have been workable and effective in the past, it is no longer a viable method to handle complex data from third-party sources. 

In this quick post, I cover key challenges to manual address data matching and how a point-to-click, self-service solution may be what your team needs to increase productivity and efficiency while obtaining accurate results. 

Let's get started. 

What is Data Matching in the Address Standardization and Validation Process?  

Supposing you have two sets of customer data A and B, with Set A representing customers belonging to age group 20 – 35 while Set B belongs to 35 – 50. Most of the people within these lists share an exact (members of the same family) or a similar address (members of a condo for instance). You want to consolidate the two lists so you can send out just one newsletter mail instead of 3 letters to 3 members of the same family.  

Data matching is the process that allows you to match these two sets on the basis of their address data and consolidate it to give you a final list of customers who share the same address.   

Sounds easy?  

Only if it were!  

For instance, the data in Set A may consist of similar or exact addresses within the same list. This means you will first have to match records within Set A, dedupe, get a list of unique records and save it as a new record. You will need to repeat the same process with Set B. 

Next, you may discover that several records of the new derivatives of Set A and B are similar. So you again run a data match between the two records to create a third record that holds consolidated information of Set A and B. But what about the original records? You might need to match them too!  

The process is mind-boggling and iterative in nature. Imagine having to do all this manually.  

This seemingly simple matching process would take days to accomplish. Users would first have to extract the data either from data source, which could be a CRM, an ERP or a data warehouse. The data would then be handed over to business users in the form of spreadsheet files and that’s when the real work starts.

Business users would have to analyze the data for common errors, validate the information of each column, clean typos, and use Excel formulas to identify null or duplicated fields. This process is repeated for every data set that needs to be matched. Once the user is satisfied with the quality of the data set, they then start the matching process.

In situations where business users are not involved, data matching is performed through extensive SQL queries. The downside of this is the limited ability for business users to truly analyze and understand the data. What if they want to get additional data in terms of gender and occupation? They will have to communicate this process to IT and the whole tedious process is revised or repeated to get a match.  

Challenges with Data Matching for Addresses  

Data matching is a needed function when working with tabular data, but it’s not an easy process.  

Some of the key challenges our customers face with data matching include:  

  • Ensuring the accuracy of the data: Most companies either do not have a data cleansing system in place or have data managers using complex queries to perform basic data cleansing. Yet, issues like negative spacing, fat-finger typos, accidental use of punctuation marks etc are not easily detected. Moreover, it takes ages to normalize and standardize address data especially since it’s the most prone to errors. 
  • Collecting data from disparate sources: Data isn’t usually readily available for matching. It usually has to be collected from several disparate sources, making it even more challenging to ensure accuracy. For instance, many companies have to collect data from vendors and third-party apps for analysis, but because these data sources differ from each other in terms of structure, it can be difficult to match them. This problem isn’t just limited to instances of external data. Businesses that are connected to multiple apps or use multiple platforms often find it difficult to consolidate their data for analysis or intelligence. 

  • Failing to measure data match scores: Unfortunately, companies usually do not measure data match scores. There are two common issues with data matching – false positives and false negatives. Both are detrimental to the company’s time and effort.  

For instance, two data sources are matched to determine duplicate addresses within a specific block. 8 out of 20 addresses are a match indicating duplication as well as the use of one address for multiple people (such as members in a family). However, 4/20 matches are false positives – meaning, the addresses are predicted to match but they are not of the same person. A missing value such as a house number may be the cause of a false positive match. 6/20 are false negatives meaning the addresses do match and do belong to the same person but the system completely missed it based on variables like a missing or incomplete middle name, or missing ZIP postal codes etc.  

In both cases, teams will have to spend time manually verifying and validating information. Manual address data matching efforts work best only when there are no inconsistencies in the data. But as we know it, data, especially modern data is anything but consistent.  

The Solution - How Do You Get Accurate Address Data Matches?  

I could simply tell you to get a top-in-line data matching tool and that would be the solution to all your problems (manual effort, lack of SQL resources etc), but that’s not how it works.  

There is a whole process of matching address data. 

  1. Integrating your data sources: If you’re collecting data from multiple sources, you’ll need to integrate in within a platform to initiate the matching activity. Integrating data though isn’t easy. You’ll have to extract the data and transform it into a CSV file or your database. In either case, you’ll need to prepare the data before you can move it around.

  2. Profiling the data: When you’re preparing the data, you’re basically profiling it to see if it contains errors, typos or missing values. If you skip this step, your matching process will fail. You need accurate data to perform an effective match.

  3. Normalizing & cleansing the data: Should all cities be written as NY, NYC or ny? Should all addresses have ZIP codes? Should any – marks be removed between numbers? These are all minor instances that degrade the quality of your data. To perform an effective match, you’ll need to clean up these inconsistencies. That’s a challenging task in itself. Data scientists/analysts spend 80% of their time just cleaning up this data. 

  4. Validating address data: And this is the tricky part. Did you know most of the addresses you have are probably not even valid? People tend to enter incomplete, incorrect, or even fake addresses. If you’re going all out with your mailing activities without first validating and verifying your address data, you’ll be losing hundreds of thousands of dollars in mailing costs. This is why you need a CASS certified solution (a USPS certified vendor) to help with and ensure your address list follows USPS guidelines. 

  5. Finally the matching: Now that your address data is clean and validated, it’s time to match. If you’re using a data match solution like DataMatch Enterprise, the process is simple and easy. You select columns you want to match – whether across, between or within data sources, adjust the match criteria and that’s it! You get results within minutes.  

Without undergoing this process, it would be next to impossible to ensure the accurate matching of your address data. If your dabbling in big data, you can never use manual methods to match data. It's time to rely on automated, self-service solutions that can allow your team to use their time to analyze data, rather than being stuck in iterative functions that are counter-productive and ineffective in delivering accurate match results. 

Views: 419

Tags: address, data, management, quality


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service