05.Data linkage04.Enrichment by machine learning - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

Enrichment by machine learning is used whenever two datasets do not share a common identifier but come from the same population.

2. Input: what kind of data does the method require?

Two datasets without a common unique identifier but belonging to the same population.

3. Algorithm: how does the method work?

Describing in words

In contrast to statistical matching, a machine learning model is created in relation to the variable that needs to be transported (imputed) to the new dataset using a group of common predictors present in both datasets. The machine learning model is then used in the target dataset to predict the value of the imputed variable of interest.

Describing in images

imgur

Data science packages

sdglinkage: Synthetic Data Generation for Linkage Methods Development.

Learning materials

Books
Articles
- Improved Correction of Misclassification Bias With Bootstrap Imputation [1].
- synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control [2].

4. SporeData-specific

Templates

Enrichment by machine learning

References

[1] van Walraven C. Improved correction of misclassification bias with bootstrap imputation. Medical care. 2018 Jul 1;56(7):e39-45.

[2] Nowok B. synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control. Technical report, Administrative Data Research Centre, Univ. of Edinburgh; 2016.