Step 1. Genotype data quality control - ariadnacilleros/Cis-mQTL-mapping-protocol-for-methylome GitHub Wiki

Step 1.1. Quality control of genotype data

Before starting with the quality control, we must have your genotype data into a PLINK binary format file (.bim, .fam, .ped), if you don’t have it, have a look at this link for more information.

Now that we have the genotype in the correct binary format, we will start the pipeline by adding the sex of the samples to the fam file. Once done this, we are going to calculate the frequency of the SNPs and execute Will Rayner’s pre-imputation check script. This Perl script checks strand, alleles, position, Ref/Alt assignments, and frequency differences, and as output, it produces a set of PLINK commands to update or remove SNPs based on the previous check described. Therefore, to execute the Perl script you need three inputs: the binary file (-b), the frequency file (-f), and the reference panel (-r), which in this case is the Haplotype Reference Consortium panel (v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab). At this point, we will have removed and corrected the SNPs that didn’t fit with the reference panel. Finally, we will change the rsIDs of the SNPs to ‘chr:position’ format.

Step 1.1.1. Filter SNPs

For this step, we will create a new folder called qc and calculate the frequency and the missing call rate which will generate two files, one with the rates of the samples (.smiss) and another one with the rates of the SNPs (.vmiss) - we will use the latter in this step. Once this part has been done, we are going to filter the SNPs based on missing call rate, Hardy-Weinberg Equilibrium (HWE), and Minor Allele Frequency (MAF). We will remove the SNPs with a missing call rate > 5% (--geno), with HWE p-value smaller than 1x10-6 (--hwe), and a MAF < 1% (--maf).

Step 1.1.2. Filter samples

After filtering the SNPs, we will start filtering the samples by missing call rate, heterozygosity, sex discordancies, and relatedness. For this, we will need to calculate the heterozygosity, compute the sex of the samples, and the relatedness by Identity By Descendent (IBD). We don’t need to calculate the missing call rate because we already did that in the previous step (.smiss file).

Following the steps from the Wiki of the repository, first, we will calculate the heterozygosity of all the samples and use R to filter out those with a value > ± 4 x standard deviation (SD). Then we will remove the samples with a missing call rate > 3% and compute the genotype-based sex of the rest. If any of your samples disagrees with the sex computed using the genotype, we recommend you to talk with the researchers who obtained the samples and prepared the databases before removing those samples, to see if there could be a mistake. Depending on each case, you might want to reassign the sex of the sample in the .fam file with --make-just-fam flag or discard the sample using --remove flag from PLINK.

After this, we will calculate the relatedness between samples using the IBD approach. Once calculated, with the R code we will select those sample pairs with a PI-HAT higher than 0.185, and we will be discarding relatives ranging from twins to second-degree relatives. In those related sample pairs, we recommend you to look at the genotype call rate of each of the two samples (.smiss file) to select the one to be removed (--remove), keeping the one with the highest genotype rate.

As an extra step, we propose you to perform a Principal Component Analysis (PCA) to see how your samples are distributed.

Step 1.1.3. Prepare data for imputation

Once we have our samples and SNPs filtered, we will start preparing the data to the format required for imputation. Therefore, the next step is to convert your binary PLINK format files of the whole genome, into 22 VCF files, one per chromosome. Then, you need to sort them, change the VCF version (if it were the case), and compress and upload them into the Michigan Imputation Server.

Step 1.2. Impute SNP genotypes in the Michigan Imputation Server

To impute our genotype, we will use the Michigan Imputation Server (Genotype Imputation Minimac4), for this you will need to register on the website, and the parameters that we use are:

  • Reference Panel: HRC r1. 2016 (GRCh37/hg19)
  • Array Build: GRCh37/hg19
  • rsq Filter: off
  • Phasing: Eagle v2.4 (phased output)
  • Population: EUR
  • Mode: Quality Control & Imputation

We will also check AES 256 encryption.

Step 1.3. Post-imputation quality control

Once your chromosome files have been imputed, you will receive an email from the Server with the password to unzip them. Now that your genotype density has been improved, we will execute the post-imputation QC script by Will Rayner, which produces an html page for visual inspection of results.

From all the imputed SNPs, we only want those imputed with high accuracy. For this reason, we will filter the variants by Rsq or R2 parameter, which estimates the correlation between the imputed and the true genotypes. Finally, we will concatenate all the chromosomes into one VCF file and extract the list of samples that pass all the filters of the genotype quality control.