To determine the optimal degree of standardization for matching variables and blocking variables in preprocessing.
Although generally regarded as essential for the success of record
linkage, there is comparatively little research done concerning
preprocessing. A major task in preprocessing is the standardization of
identifier values. For example, the German umlauts ä, ö, ü are typically
replaced by ae, oe, and ue. Another common operation is to remove
titles from surname fields. Though there is an often overlooked drawback
of standardizing identifiers: It may essentially be the removed or
standardized part of an identifier value that differentiates between
false positive and true positive matches. That is, there is a balance
between to little and to much standardization in matching variables.
Additionally, for sure this balance differs when dealing with blocking