Research at GRLC focuses on security and computational speed of Privacy-preserving Record Linkage (PPRL) techniques. The following paragraphs give an overview of the current research topics central to the GRLCs research team.
Increasing the Security of Privacy-preserving Record Linkage Techniques
Although no single real-world attack on research data bases has been reported up to now (Emam et al. 2011), academic research concentrates on attacks on research data within a linkage unit. Therefore, the resilience of PPRL encodings against all known cryptographic attacks is widely considered as essential for the successful implementation of PPRL protocols in practice. However, most record linkage implementations used in real-world settings (for example, within cancer registries) do not withstand these kinds of attacks. This is due to the fact that a simple alignment of the most frequent names and the most frequent encoded names would identify at least some records (Domingo-Ferrer/Muralidhar 2016). Hence, the GRLC is developing technical measures to prevent cryptographic attacks (Niedermeyer et al. 2014). The currently recommended parameter and encryption settings (Christen et al. 2017) yield encryptions which can not be successfully attacked by any known method. The amount of effort required to break the recently developed encryptions (Schnell/Borgs 2016) is considered by current research as more than sufficient to fulfill the EU criterion of de-facto anonymity.
Enable Large-scale Linkage by Improving the Speed and Efficiency of Blocking Techniques
For large-scale applications, comparing all possible pairs of two ore more data bases during a linkage is not feasible. Therefore, reducing the search space to subsets of records is recommended. These techniques are designated as blocking methods. Developing new blocking methods is an active field of research within (privacy-preserving) record-linkage (Christen 2012). The current recommendation (Schnell 2016) for linking Bloom filter-based encryptions is the use of Multibit trees (Kristensen et al. 2010) with additional encrypted identifiers such as year of birth as blocking variable (Schnell 2014). Combining external blocks such as year of birth with Multibit trees allows for privacy preserving linkage of two census scale data sets within a few hours. For most applications, this solution is sufficient with regard to speed, accuracy and privacy (Brown et al. 2017). Therefore, this combination is provided with the record-linkage software ofthe GRLC.
For many research topics, geographical distances between units are of importance. For example, the distance to the next hospital after a stroke or a heart attack is important for a positive outcome. Since including coordinates in a scientific-use file will cause privacy problems, methods to prevent re-identification of geo-locations while preserving distances between units are required (geo-masking methods). Research at the GRLC developed two new geo-masking methods (Kroll/Schnell 2016).
Different Linkage Quality and Bias by Demographic Subgroups
The consequences of non-linked real-world entities have received little attention outside of the medical literature. Previous research has shown possible bias due to differential linkage (Ford et al. 2006). Since the linkage results seem to be dependent on demographic variables such as age, income and health, estimators from linked data sets could be biased (Harron et al. 2017). Research at the GRLC tries to identify possible factors determining linkage quality by demographic subgroups. With this information, different fine-tuned linkage strategies for each subgroup could be implemented. This way, linkage error, and possibly bias, can be reduced in settings where the demographic variables are available in the data sets.
Brown, A.P., C. Borgs, S. M. Randall, R. Schnell (2017), Evaluating Privacy-Preserving Record
Linkage Using Cryptographic Long-Term Keys and Multibit Trees on Large Medical
Datasets. BMC Medical Informatics and Decision Making 17(1): 83.
Christen, P. (2012), Data Matching: Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection. Springer, Berlin.
Christen, P., R. Schnell, D. Vatsalan, T. Ranbadudge (2017), Efficient Cryptanalysis of Bloom
Filters for Privacy-Preserving Record Linkage. 628–640 in: J. Kim (ed.), The Pacific-Asia
Conference on Knowledge Discovery and Data Mining. Springer, Cham.
Domingo-Ferrer, J., K. Muralidhar (2016), New Directions in Anonymization: Permutation
Paradigm, Verifiability by Subjects and Intruders, Transparency to Users. Information
Sciences 337–338: 11–24.
Emam, K.E., E. Jonker, L. Arbuckle, B. Malin (2011), A Systematic Review of Re-Identification
Attacks on Health Data. PLoS One 6(12): e28071.
Ford, J. B. , Roberts, C. L., Taylor, L. K. (2006), Characteristics of Unmatched
Maternal and Baby Records in Linked Birth Records and Hospital Discharge
Data. Paediatric and Perinatal Epidemiology 20(4): 329-337.
Harron, K. L., J. C. Doidge, H. E. Knight, R. E. Gilbert, H. Goldstein, D. A. Cromwell,
J. H. van der Meulen (2017), A guide to evaluating linkage quality for the analysis of linked data.
International Journal of Epidemiology 46(5): 1699-1710.
Kristensen, T. G., J. Nielsen, C.N.S. Pedersen (2010), A Tree-Based Method for the Rapid
Screening of Chemical Fingerprints. Algorithms for Molecular Biology 5.
Kroll, M., R. Schnell (2016), Anonymisation of geographical distance matrices via Lipschitz
Embedding. International Journal of Health Geographics 15(1): 1-14.
Niedermeyer, F., S. Steinmetzer, M. Kroll, R. Schnell (2014), Cryptanalysis of Basic Bloom
Filters Used for Privacy Preserving Record Linkage. Journal of Privacy and
Confidentiality 6(2): 59–79.
Schnell, R. (2016), Privacy preserving record linkage. 201–225 in: K. Harron, H. Goldstein,
C. Dibben (eds.), Methodological Developments in Data Linkage. Wiley, Chichester.
Schnell, R. (2014). An efficient Privacy-Preserving Record Linkage Technique for Administrative
Data and Censuses. Journal of the International Association for Official Statistics 30:
Schnell, R., C. Borgs (2016), Randomized Response and Balanced Bloom Filters for Privacy
Preserving Record Linkage. In: 2016 IEEE 16th International Conference on Data Mining
Workshops (ICDM 2016), Barcelona, December 12th 2016 – Dec 15th 2016. IEEE Publishing.