Fuzzy matching: comparing two matching engines

Hi Experts! I am new, so sorry if my topic is a bit "foreign" to the topics discussed here. It would be great, however, to hear your opinions on this one, since this topic is the one I am working on at the moment. The topic is: how to compare two record matching functionalities that are based on rather different algorithms and business rule definition capacities. For example, let's imagine we pass the same sample of input data to be matched via MS SQL's Fuzzy Lookup/Grouping, and then we do the same with similar functionality in SAS or SPSS (Oracle, SAP, etc.). What would be the "right way" to compare the "quality of matching"? What if two engines produce similar percentages of non-matches, is there a way to state that one engine matched the records better than the other one? We are considering comparisons of record counts per grouping as one of possible criteria. Would be interesting to hear other opinions.

 

Thank you!

 

Sergey

Views: 361

Reply to This

Replies to This Discussion

I've automated a quarterly pull of the SDN list with SSIS Fuzzy Matching successfully (MSSQL 2008R2 primarily).I also used a few other SOUNDEX and a JaroWinkler Functions to scan customers/partners for address and name matches on the SDN in the same SSIS package for branching the flow.  It is fairly easy to randomly add a few test cases for verification to ensure all test records are branched off and captured. 

http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/de...

Here's the research paper on algorithms used in SSIS's Fuzzy Matching Similarity and Confidence scores  (IDF, Tokenization):

http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

Haven't tried too many other tools except some R scripts with SSIS.

Reply to Discussion

RSS

Follow us

© 2013   Data Science Central

Badges  |  Report an Issue  |  Terms of Service