Address data is semi-structured, making it one of the most challenging components in a data matching activity. For long now, manual data matching methods including extensive SQL programming and spreadsheet formulas have been used to match address lists. While this may have been workable and effective in the past, it is no longer a viable method to handle complex data from third-party sources.
In this quick post, I cover key challenges to manual address data matching and how a point-to-click, self-service solution may be what your team needs to increase productivity and efficiency while obtaining accurate results.
Let's get started.
Supposing you have two sets of customer data A and B, with Set A representing customers belonging to age group 20 – 35 while Set B belongs to 35 – 50. Most of the people within these lists share an exact (members of the same family) or a similar address (members of a condo for instance). You want to consolidate the two lists so you can send out just one newsletter mail instead of 3 letters to 3 members of the same family.
Data matching is the process that allows you to match these two sets on the basis of their address data and consolidate it to give you a final list of customers who share the same address.
Only if it were!
For instance, the data in Set A may consist of similar or exact addresses within the same list. This means you will first have to match records within Set A, dedupe, get a list of unique records and save it as a new record. You will need to repeat the same process with Set B.
Next, you may discover that several records of the new derivatives of Set A and B are similar. So you again run a data match between the two records to create a third record that holds consolidated information of Set A and B. But what about the original records? You might need to match them too!
The process is mind-boggling and iterative in nature. Imagine having to do all this manually.
This seemingly simple matching process would take days to accomplish. Users would first have to extract the data either from data source, which could be a CRM, an ERP or a data warehouse. The data would then be handed over to business users in the form of spreadsheet files and that’s when the real work starts.
Business users would have to analyze the data for common errors, validate the information of each column, clean typos, and use Excel formulas to identify null or duplicated fields. This process is repeated for every data set that needs to be matched. Once the user is satisfied with the quality of the data set, they then start the matching process.
In situations where business users are not involved, data matching is performed through extensive SQL queries. The downside of this is the limited ability for business users to truly analyze and understand the data. What if they want to get additional data in terms of gender and occupation? They will have to communicate this process to IT and the whole tedious process is revised or repeated to get a match.
Data matching is a needed function when working with tabular data, but it’s not an easy process.
Some of the key challenges our customers face with data matching include:
For instance, two data sources are matched to determine duplicate addresses within a specific block. 8 out of 20 addresses are a match indicating duplication as well as the use of one address for multiple people (such as members in a family). However, 4/20 matches are false positives – meaning, the addresses are predicted to match but they are not of the same person. A missing value such as a house number may be the cause of a false positive match. 6/20 are false negatives meaning the addresses do match and do belong to the same person but the system completely missed it based on variables like a missing or incomplete middle name, or missing ZIP postal codes etc.
In both cases, teams will have to spend time manually verifying and validating information. Manual address data matching efforts work best only when there are no inconsistencies in the data. But as we know it, data, especially modern data is anything but consistent.
I could simply tell you to get a top-in-line data matching tool and that would be the solution to all your problems (manual effort, lack of SQL resources etc), but that’s not how it works.
There is a whole process of matching address data.
Without undergoing this process, it would be next to impossible to ensure the accurate matching of your address data. If your dabbling in big data, you can never use manual methods to match data. It's time to rely on automated, self-service solutions that can allow your team to use their time to analyze data, rather than being stuck in iterative functions that are counter-productive and ineffective in delivering accurate match results.