This paper discusses techniques for merging data files where no key field exists between the files. The paper will illustrate an approach to resolve two issues that are common to most fuzzy matching techniques: 1) how to weight proxy identifier fields, and 2) how to measure the Type One and Type Two errors of the merge estimation algorithm.
A common requirement in analytics is to merge records in two or more large sets of information (i.e., thousands if not millions of records) where no exact key exists to match records between the information sets. When no exact key between the two data sets exists, a common merging solution is to use “fuzzy” matching. “Fuzzy” matching uses proxy keys as substitute keys to match records between the two data files.
The name “fuzzy” comes from the uncertainty in the accuracy of the merge. With “fuzzy” matching it is possible some linkages of records will be false and/or some valid merges will be missed. Therefore, a specific “fuzzy” match can be true or false and the analyst is not sure with the certainty of the match, i.e., the match is “fuzzy”. In statistical terms, fuzzy matches have Type One (false positive) and Type Two (false negative) uncertainties.
As is often the case, an analyst may have multiple proxy keys to choose from for the “fuzzy” merge. When this occurs, a second issue must be addressed. Which proxy keys will optimize the merging of records in the two files? For example, suppose an analyst has the following proxy keys to choose from: last name, social security number, and residence zip code. Which combination of proxy keys would produce the optimal matches?
There are many approaches to these issues. However, all “fuzzy” techniques have the same two problems: a) what is the matching error, and b) what are the proxy data fields and weights on those proxy fields which will maximize the “fuzzy” match. Thus, most “fuzzy” matching has a high degree of uncertainty as to its accuracy.
In this paper, I will address these two issues by applying bootstrapping methods to “fuzzy” matching.
Bootstrapping is the practice of determining the properties of estimators (such as regression coefficients) by measuring those properties when sampling from an approximating distribution. It can also be used for constructing hypothesis tests. It is often used as an alternative to inference based on parametric assumptions when those assumptions are impossible to calculate such as with “fuzzy” matching.
The basic idea of bootstrapping is that the sample parameters are our best guess about the population parameters from which the sample was taken. As an example, assume we are interested in “fuzzy” matching two data files and determining the accuracy of the “fuzzy” matches and the proposed weights of the proxy matching fields. If we cannot obtain these estimators for the entire population, we might be able to obtain these estimators for a sample of this population. The key principle of the bootstrap is to provide a way to simulate repeated observations from an unknown population using the obtained sample as a basis.
For example, suppose the analyst can correctly match some of the records between the two data files. Many populations have unique identifiers for at least some records in the data file. For example, Social Security numbers, Medicare id numbers, Medicaid id numbers, Veterans id numbers, etc. Suppose both our data files have the same unique identifier for some records. We can use these identifiers to find corresponding records in our two data files for the same individual. This correctly matched sample of the two populations can be then used to calculate both the proxy weights and the Type One and Type Two errors of the proposed “fuzzy” matched population.
A common problem with the “fuzzy” matching technique is how to validate the matched data. With a bootstrap sample, correct matches are known by definition. Therefore, we can use the sample data to test the accuracy of the fuzzy match regression equation that we have generated. In this paper, I recommend testing for errors in the model created from the probit regression model by assessing the model against a part of the sample data that was not used to create the model. The bootstrap sample should be randomly split into two parts. One part is used to create the estimation coefficients and the second part (hereafter the holdout sample) is used to generate Type One and Type Two error rates.
Accuracy is measured by whether records were correctly classified in the holdout sample with the probit coefficients created by the Monte Carlo process. There are four possible classifications:
1. prediction of no match when the holdout sample record was not a match (True Negative)
2. prediction of no match when the holdout sample record was a match (False Negative) Type Two Error
3. prediction of a match when the holdout sample record was not a match (False Positive) Type One Error
4. prediction of a match when the holdout sample record was a match (True Positive)
The percent of correctly classified observations in the holdout sample is referred to as the assessed model accuracy. In addition, you can use the sample data to vary the magnitude of the Type One and Type Two errors by raising or lowering the probability which will define if a set of records matches. One can vary this probability from .50 to .75, to .90, to .95, or to .99 as you wish. Asymmetry in the Type One and Type Two error rates may be relevant if avoiding either a Type One or Type Two is more important. For example, suppose you are merging the medical records for a patient into a single record. In this example, it might be more important to avoid a Type one error and less important to avoid a Type II error.
STEPS TO THE BOOTSTRAP PROCEDURE
1. Correct all bad data on your sample unique match key
2. Match the two data sets by proxy identifiers
3. Denote matched records with M=1
4. To create the nonmatching sample records, take the remaining records in the two data sets and match by all possible proxy matching indicators.
5. Delete all records which match in the two data sets by any of your proxy indicators. Note that many of the criteria may have missing values which are not considered a match.
6. The remaining nonmatching records will be denoted as unmatched with M=0
7. Divide the Matched records into two equal parts. Do the same with the Unmatched records.
8. Combine one-half of matched records with one-half of the unmatched records and label this as the Bootstrap sample to be used in calculating the regression parameters.
9. Combine the remaining data and designate those records as the Hold Out sample to be used to calculate Type One and Type Two errors of the model.
10. Combine Bootstrap and Hold Out Samples and denote as Sample data S=1.
11. Combine Sample data with the remaining population of records and denote these records as not in sample S=0 .
12. Calculate Selection Sample Bias by identifying data fields that might be correlated with whether a record was in the sample.
13. I recommend using the SAS procedure, SAS® QLIM, to calculate the sample selection bias for the sample because QLIM produces the inverse Mills ratio as an option in the output file of a multivariate probit model. I used SAS® for my testing but you could use other statistical software such as R which has similar statistical procedures.
14. Next, make a series of indicator variables that signify if a data field value in one data set matched a data field in the other data set in the sample file.
15. Use these indicator fields from Step 14 and the Mills ratio calculated in Step 13 to estimate the dependent variable M (1=match, -1=no match, 0=blank data) using the SAS® LOGISTIC REGRESSION PROCEDURE.
16. I recommend using the Logistic Regression Procedure because it provides several useful features:
a. PROC LOGISTIC calculates pseudo R2;
b. The LOGISTIC PROCEDURE runs faster than PROC QLIM;
c. Your model may have quasi-complete separation which needs correction before estimating the model. Quasi-complete separation happens where at least one group of the dependent variable has zero frequency for at least one category of an independent variable. Quasi-complete separation can cause convergence failures in the model, which consequently result in potentially biased results. PROC LOGISTIC includes the Firth penalized likelihood method option to correct for this issue (for more on this method, see Georg Heinze (2002 Statistics in Medicine 21:2409-2419 and 2006 25:4216-4226).
d. You can specify the use of the probit model instead of the logistic model.
17. Repeat steps 13, 14, 15, and 16 with a new random sample from the full sample populations. Continue to repeat sampling until the moving average of the regression coefficients asymptotes.
18. Once the estimation process is complete, apply the model coefficients to the sample data which has been placed in the “holdout” file.
19. The hold-out estimates are used to calculate the power function of the model and the Type One and Type Two errors. With this information, you can optimize the probability of a match to minimize the Type One and Type Two errors.
20. Finally, apply the estimation model to the data file records which were not in the sample and determined how many matches were estimated by the model.
The intent of this article is to demonstrate a statistically-based method of doing fuzzy matches. The applicability of this approach is tied to several criteria: 1) you need many records to work with, 2) you have must several alternate matching criteria (i.e., name, address, etc.) to match with, and 3) a reasonable number of records have no missing information for your matching indicator fields. If your data can meet these three criteria, then you may want to consider this approach for your fuzzy issue.
Your comments and questions are valued and encouraged. Contact the author at:
E-mail: [email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.