Idefix supplementary - molgenis/systemsgenetics GitHub Wiki
Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores
The method described here aims to identify sample mix-ups in biobanks using polygenic scores (PGS). Sample mix-ups frequently occur in genetic genomic datasets generated in a research setting (Westra et al., 2011). This novel tool takes advantage of the relationship between PGSs and actual phenotypes to predict which samples are erroneous. Details of this new method are described in a yet to be published article. The GitHub repository can be cloned from here.
Contents
Repository overview
In the repository a couple of main scripts are included as well as a number of scripts that can be used for diagnostic purposes.
-
Main scripts (
./src
):install-packages.R
: Installs required packages.sample-swap-prediction.R
: Main sample mix-up prediction script.
-
Helper scripts (
./src
):generate-sample-couplings.R
: Introducing fake sample mix-ups for ROC calculations.sum-plink-profiles.R
: Sum polygenic scores from PLINK over chromosomes. (More in input - polygenic scores).
-
Diagnostic scripts (
./src/diagnostic-scripts
):plot-roc-figures.R
: Plots ROC curves for both sex concordance check, PGS-based sample swap prediction and combined ROCs. Additionally, corresponding confusion matrices for ROCs are plotted as well.polygenic-score-power-calculation.R
,compare-runs.R
,plot-pgs-predictive-power.R
,polygenic-score-power-calculation.R
andplot-intermediate-figures.R
all contain functionality for plotting additional intermediate results.
-
Lifelines specific files: (
./src/lifelines
,./data/lifelines
): Files specific to Lifelines phenotype processing. The files in./data/lifelines
can be used as a reference for generating files specific to other studies. -
Scripts for simulations: (
./src/lifelines/simulations
):simulate-data.R
: A script that are is used to simulated data from the Lifelines dataset.compare-simulations.R
: A script that is used to compare results for simulated datasets.
Usage
Obtaining a reliable ROC-AUC
In order to estimate the performance of Idéfix,
we require a dataset in which it is known which samples are mix-ups and which samples are correct.
This can be achieved by introducing fake mix-ups into a dataset.
To get a reliable performance estimate, a considerable number of mix-ups will have to be introduced.
However, since Idéfix expects that the majority of the sample mappings is correct,
introducing a large proportion of sample mix-ups might underestimate the performance.
Therefore, we suggest creating a separate training and testing dataset. This can be done by using the
generate-sample-couplings.R
and sample-swap-prediction.R
scripts. For ease of use
the, --split-prediction
option in sample-swap-prediction.R
can also be used. Here, the
steps for the original method are shown.
-
Sample half of the available samples from the entire study using the
./generate-sample-couplings.R
script with in conjunction with the--sample-count
option. Do not introduce fake mix-ups in this step. -
Perform the sample mix-up prediction (
sample-swap-prediction.R
) using the sample coupling file obtained in step 1. Write the fitted models to a directory of choice using the--base-fit-model-path
option. -
Get the remaining samples from the study (
./generate-sample-couplings.R
). Use the option--sample-coupling-file-exclude
to exclude the first half of the study from step 1, and introduce 50% mix-ups. -
Perform the sample mix-up prediction (
sample-swap-prediction.R
) using the sample coupling file obtained in step 3, and the fitted models from step 2. The output fileoverallOutputStatistics.tsv
contains an accurate AUC for the sample mix-up predictions. -
The ROC curve can be compared with the sex concordance check and the combined predictive power using the
./diagnostic-scripts/plot-roc-figures.R
script.
Introducing fake mix-ups
./generate-sample-couplings.R
is a script that introduces a number of fake mix-ups or downsamples the number
of samples. This can be helpful when you are trying to obtain a reliable ROC. The script can be used as follows.:
usage: ./generate-sample-couplings.R [-h]
[--mix-up-percentage MIX_UP_PERCENTAGE]
[--sample-count SAMPLE_COUNT] --out OUT
[--sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE]
(--sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE
| --phenotypes-file PHENOTYPES_FILE)
optional arguments:
-h, --help show this help message and exit
--mix-up-percentage MIX_UP_PERCENTAGE
introduce mix-ups in link fileand phenotype sample ids
in the second column
--sample-count SAMPLE_COUNT
number of samples to include in the coupling file
--out OUT path to output prefix
--sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE
file containing genotype sample ids in the first
column, and phenotype sample ids in the second
column. the samples in the genotype column will be
exluded.
--sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE
file containing genotype sample ids in the first
columnand phenotype sample ids in the second
columnthese samples will be used as a starting point.
--phenotypes-file PHENOTYPES_FILE
path to a tab-delimited file holding all processed
phenotype data.
Receiver operating characteristics
More ROC curves can be plotted using the ./src/diagnostic-scripts/plot-roc-figures.R
script.
This includes ROC curves for a sex-check as well as a combined ROC curve. confusion matrix are also visualized.
The script can be executed as described below:
usage: ./diagnostic-scripts/plot-roc-figures.R [-h] --dir DIR
--phenotypes-file
PHENOTYPES_FILE
optional arguments:
-h, --help show this help message and exit
--dir DIR path from where to read sample swap prediction
results.
--phenotypes-file PHENOTYPES_FILE
path to a tab-delimited file holding all processed
phenotype data.
Simulating data
We simulated data according to the ./src/lifelines/simulations/simulate-data.R
script.
This script is tailored to the Lifelines data. It will generate datasets with explained variance
of 50%, 100%, 150%, and 200% of the original explained variance, and 10, 25, 50, 75 and 100 traits.