05.Data linkage01.Probabilistic - sporedata/researchdesigneR GitHub Wiki
- Probabilistic linking is often used when two datasets have a common unique identifier, but this identifier is not perfect. In other words, the same patient will not be linked across datasets unless the exact same identifier is present in both datasets. In real-world datasets this perfect matching is uncommon. Probabilistic linking addresses this problem by assigning a probability that any two patients in different datasets are the same individual even if they do not have the same identifier. - see Mortality Among Unsheltered Homeless Adults in Boston, Massachusetts, 2000-2009
- Probabilistic linkage can also be used when attempting to track patients over time to create a longitudinal cohort - see Accuracy of a probabilistic record-linkage methodology used to track blood donors in the Mortality Information System database
- Two datasets with a common unique identifier, but where the linkage is imperfect or needs to be quality checked
- Books
- Articles
[1] Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. International journal of epidemiology. 2016 Jun 1;45(3):954-64.
[2] Hagger-Johnson G, Harron K, Goldstein H, Aldridge R, Gilbert R. Probabilistic linking to enhance deterministic algorithms and reduce linkage errors in hospital administrative data. Journal of innovation in health informatics. 2017 Jun 30;24(2):891.
[3] Nowok B. synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control. Technical report, Administrative Data Research Centre, Univ. of Edinburgh; 2016.