Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching.
A major challenge in data matching is the lack of common entity identifiers across different source systems to be matched. As a result of this, the matching needs to be conducted using attributes that contain partially identifying information, such as names, addresses, or dates of birth. However, such identifying information is often of low quality and especially suffer from frequently occurring typographical variations and errors, such information can change over time, human errors or it is only partially available in the sources to be matched.
In the past decade, significant advances have been achieved in many aspects of the data matching process, but especially on how to improve the accuracy of data matching, and how to scale data matching to very large systems that contain many millions of records. This work has been conducted by researchers in various fields, including applied statistics, health sciences, data mining, machine learning, artificial intelligence, information systems, information retrieval, knowledge engineering, the database and data warehousing communities, and researchers working in the field of digital libraries.
For original article, click here