Process SNP data - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki

MODULE STATUS

Developer: Developer's team

Scripts status: Ready

Prerequisite scripts: script 00 and 01

Data upload method: automated upload to sftp bristol

Imputed Data

Once the imputed genetic data is in the correct format and in the correct location, the following tasks need to be performed:

  • Extract relevant IDs
  • HWE and MAF QC
  • Genotype missingness QC
  • Remove sexcheck outliers if chr X is available
  • Re-code SNP IDs (chr:position_A1_A2 (ascii sorted order))
  • Remove SNPs with alternate coding
  • Remove duplicate SNPs
  • Remove poorly imputed SNPs
  • Construct kinship matrices
  • Remove PC outliers
  • Remove any cryptic relatedness for 'unrelated' data
  • Create a pedigree matrix for 'related' samples
  • Calculate principal components
  • Remove SNPs with mismatched alleles and discordant allele frequency

To do this, run the following script:

    ./02a-snp_data.sh

For a sample size of 100 this script takes less than 5 minutes to run on our computers or 10 minutes on 1800 samples. The relevant files should have been generated and saved in processed_data/genetic_data.

Note that for 'unrelated' samples we are generating PCs by removing long LD tracts, extracting HapMap3 SNPs, LD pruning, and then calculating PCs on the data after cryptic relateds have been removed. For 'related' samples we are using a general method implemented in the Rpackage 'GENESIS' for estimating PCs that accounts for any relatedness. This involves estimating a kinship matrix that is robust to population structure and relatedness (Manichaikil et al 2010), estimating principal components on an unrelated subset, and then projecting principal components onto the related subset based on estimated kinships (Conomos et al 2015).

After you have run this script please check:

cd results/02

IMPORTANT PLEASE CHECK YOUR RESULTS:

  • The pcaplot.pdf is a pca plot on the pruned genetic data. The blue dashed lines define the SD threshold. You can check the number of genetic outliers that are removed by:
grep outlier ./logs_a/log.txt

If you think the outlier threshold should be adjusted, you can change the sd threshold pca_sd in the config file. If you change the threshold you need to rerun ./01-check_data.sh and 02a-snp_data.sh.

  • The easyQC_hrc.rep shows the number of SNPs that are flipped because they were on the wrong strand as compared to HRC ("AlleleChange"). Please find an example here: (https://github.com/genetics-of-dna-methylation-consortium/godmc_phase2/wiki/images//easyQC_hrc.rep)

  • The easyQC_hrc.rep file also shows you the number of SNPs that are going to be removed. eg. allele mismatches ("AlleleMismatch") and also SNPs that have a discrepant allele frequency (>0.2) as compared to HRC ("AFCHECK.numOutlier").

  • The easyQC.multi.AFCHECK.png plot shows a comparison between allele frequencies of your study as compared to HRC. Only outlying SNPs that differ > 0.2 in terms of allele frequency from the reference frequency, will be plotted. Please read the link for explanation of this plot here. Please find two examples below. All the SNVs in red will be removed from the analysis. You will see an empty plot if there are no SNPs to remove.

Additionalfiles/easyQC.multi.AFCHECK.png.

Convert to OSCA format

Next, we need to generate the correct format for the vmQTL analysis software (OSCA). To do this run the following script form your home directory:

    ./02b-convert_snp_format.sh

This script will split the genetic dataset into a number of smaller chunks. The number of chunks is determined by the genetic_chunks variable in the config file. For example:

genetic_chunks="500"

is the default and will split all the SNP data into 500 subsets (e.g. for 8 million SNPs each chunk will have 16000 SNPs). The required files will be saved in processed_data/genetic_data/tabfile/.

This script takes a few minutes to run for sample size of 100 and 1 hour to run ~1800 samples.

Upload the results

To check that everything ran successfully, please run:

./check_upload.sh 02 check

This should tell you that Section 02 has been successfully completed!. Now please upload the results like this:

./check_upload.sh 02 upload

It will make sure everything looks correct and connect to the sftp server. It will request your password (this should have been provided to you along with your username). Once you have entered your password it will upload the results files from section 02.