Idefix supplementary - molgenis/systemsgenetics GitHub Wiki

Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores

The method described here aims to identify sample mix-ups in biobanks using polygenic scores (PGS). Sample mix-ups frequently occur in genetic genomic datasets generated in a research setting (Westra et al., 2011). This novel tool takes advantage of the relationship between PGSs and actual phenotypes to predict which samples are erroneous. Details of this new method are described in a yet to be published article. The GitHub repository can be cloned from here.

Detailed repository overview
Usage

Obtaining a reliable ROC-AUC
Simulating data

Repository overview

In the repository a couple of main scripts are included as well as a number of scripts that can be used for diagnostic purposes.

Main scripts (./src):
- install-packages.R: Installs required packages.
- sample-swap-prediction.R: Main sample mix-up prediction script.
Helper scripts (./src):
- generate-sample-couplings.R: Introducing fake sample mix-ups for ROC calculations.
- sum-plink-profiles.R: Sum polygenic scores from PLINK over chromosomes. (More in input - polygenic scores).
Diagnostic scripts (./src/diagnostic-scripts):
- plot-roc-figures.R: Plots ROC curves for both sex concordance check, PGS-based sample swap prediction and combined ROCs. Additionally, corresponding confusion matrices for ROCs are plotted as well.
- polygenic-score-power-calculation.R, compare-runs.R, plot-pgs-predictive-power.R, polygenic-score-power-calculation.R and plot-intermediate-figures.R all contain functionality for plotting additional intermediate results.
Lifelines specific files: (./src/lifelines, ./data/lifelines): Files specific to Lifelines phenotype processing. The files in ./data/lifelines can be used as a reference for generating files specific to other studies.
Scripts for simulations: (./src/lifelines/simulations):
- simulate-data.R: A script that are is used to simulated data from the Lifelines dataset.
- compare-simulations.R: A script that is used to compare results for simulated datasets.

Usage

Obtaining a reliable ROC-AUC

In order to estimate the performance of Idéfix, we require a dataset in which it is known which samples are mix-ups and which samples are correct. This can be achieved by introducing fake mix-ups into a dataset. To get a reliable performance estimate, a considerable number of mix-ups will have to be introduced. However, since Idéfix expects that the majority of the sample mappings is correct, introducing a large proportion of sample mix-ups might underestimate the performance. Therefore, we suggest creating a separate training and testing dataset. This can be done by using the generate-sample-couplings.R and sample-swap-prediction.R scripts. For ease of use the, --split-prediction option in sample-swap-prediction.R can also be used. Here, the steps for the original method are shown.

Sample half of the available samples from the entire study using the ./generate-sample-couplings.R script with in conjunction with the --sample-count option. Do not introduce fake mix-ups in this step.
Perform the sample mix-up prediction (sample-swap-prediction.R) using the sample coupling file obtained in step 1. Write the fitted models to a directory of choice using the --base-fit-model-path option.
Get the remaining samples from the study (./generate-sample-couplings.R). Use the option --sample-coupling-file-exclude to exclude the first half of the study from step 1, and introduce 50% mix-ups.
Perform the sample mix-up prediction (sample-swap-prediction.R) using the sample coupling file obtained in step 3, and the fitted models from step 2. The output file overallOutputStatistics.tsv contains an accurate AUC for the sample mix-up predictions.
The ROC curve can be compared with the sex concordance check and the combined predictive power using the ./diagnostic-scripts/plot-roc-figures.R script.

Introducing fake mix-ups

./generate-sample-couplings.R is a script that introduces a number of fake mix-ups or downsamples the number of samples. This can be helpful when you are trying to obtain a reliable ROC. The script can be used as follows.:

usage: ./generate-sample-couplings.R [-h]
                                     [--mix-up-percentage MIX_UP_PERCENTAGE]
                                     [--sample-count SAMPLE_COUNT] --out OUT
                                     [--sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE]
                                     (--sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE 
                                       | --phenotypes-file PHENOTYPES_FILE)

optional arguments:
  -h, --help            show this help message and exit
  --mix-up-percentage MIX_UP_PERCENTAGE
                        introduce mix-ups in link fileand phenotype sample ids
                        in the second column
  --sample-count SAMPLE_COUNT
                        number of samples to include in the coupling file
  --out OUT             path to output prefix
  --sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE
                        file containing genotype sample ids in the first
                        column, and phenotype sample ids in the second
                        column. the samples in the genotype column will be
                        exluded.
  --sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE
                        file containing genotype sample ids in the first
                        columnand phenotype sample ids in the second
                        columnthese samples will be used as a starting point.
  --phenotypes-file PHENOTYPES_FILE
                        path to a tab-delimited file holding all processed
                        phenotype data.

Receiver operating characteristics

More ROC curves can be plotted using the ./src/diagnostic-scripts/plot-roc-figures.R script. This includes ROC curves for a sex-check as well as a combined ROC curve. confusion matrix are also visualized. The script can be executed as described below:

usage: ./diagnostic-scripts/plot-roc-figures.R [-h] --dir DIR
                                               --phenotypes-file
                                               PHENOTYPES_FILE

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR             path from where to read sample swap prediction
                        results.
  --phenotypes-file PHENOTYPES_FILE
                        path to a tab-delimited file holding all processed
                        phenotype data.

Simulating data

We simulated data according to the ./src/lifelines/simulations/simulate-data.R script. This script is tailored to the Lifelines data. It will generate datasets with explained variance of 50%, 100%, 150%, and 200% of the original explained variance, and 10, 25, 50, 75 and 100 traits.