05.Data linkage01.Probabilistic - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

Probabilistic linking is often used when two datasets have a common unique identifier, but this identifier is not perfect. In other words, the same patient will not be linked across datasets unless the exact same identifier is present in both datasets. In real-world datasets this perfect matching is uncommon. Probabilistic linking addresses this problem by assigning a probability that any two patients in different datasets are the same individual even if they do not have the same identifier. - see Mortality Among Unsheltered Homeless Adults in Boston, Massachusetts, 2000-2009
Probabilistic linkage can also be used when attempting to track patients over time to create a longitudinal cohort - see Accuracy of a probabilistic record-linkage methodology used to track blood donors in the Mortality Information System database

Two datasets with a common unique identifier, but where the linkage is imperfect or needs to be quality checked

Books
Articles
- Probabilistic Record Linkage [1].
- Probabilistic Linkage to Enhance Deterministic Algorithms and Reduce Data Linkage Errors in Hospital Administrative Data [2].
- synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control [3].

[1] Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. International journal of epidemiology. 2016 Jun 1;45(3):954-64.

[2] Hagger-Johnson G, Harron K, Goldstein H, Aldridge R, Gilbert R. Probabilistic linking to enhance deterministic algorithms and reduce linkage errors in hospital administrative data. Journal of innovation in health informatics. 2017 Jun 30;24(2):891.