meQTL mapping analysis cookbook - molgenis/systemsgenetics GitHub Wiki
This is the cookbook for performing the meQTL analysis. This cookbook illustrates only a part of the capabilities of our software. See for other options the other parts of the wiki.
##Questions or suggestions? You can contact the authors of this manual and software at natalia.tsernikova @ gmail.com, bonder.m.j @ gmail.com, westra.harmjan @ gmail.com, dashazhernakova @ gmail.com or lude @ ludesign.nl for questions or suggestions regarding the software or the manual. This manual was written in cooperation with Marjolein Peters and Joyce van Meurs.
##Manual contents
- Step 1 - Preparation of methylation data
- Step 2 - Preparation of genotype data
- File Checklist
- Step 3 - Methylation normalization
- Step 4 - MixupMapper
- Step 5 - Determining PCs to remove
- Step 6 - Performing the cis-QTL analysis
- Step 7 - Performing the trans-QTL analysis
##Downloading the software and reference data You will need four programs to perform the QTL analysis.
- The software to process the methylation data. You can download the latest version of the 450K_DataProcessing R scripts here.
- The PLINK toolcase (v 1.07), to calculate MDS components.
- The software to process and convert the genotype data, please use version 1.4.15 available here: Genotype Harmonizer.
- The actual QTL mapping software. You can download version 1.2.4D of the software from the FTP, or you can find it here: eqtl-mapping-pipeline-1.2.4D-SNAPSHOT-dist.zip or eqtl-mapping-pipeline-1.2.4D-SNAPSHOT-dist.tar.gz
Additionaly you will need the following reference files:
- Giant genotype reference file, which can be downloaded here.
- The annotation file for the 450K probes, which can be found at the FTP.
##Before you start
In this manual we assume that you have basic knowledge on how to work with the java Virtual Machine (VM) and the command line. If you are not sure about working command line, please have a look at our introduction page and/or check the information about the Java VM (including the Java memory options). If you run into any problems with heapspace or out of memory errors please supply the -Xmx
and -Xms
command, see Java VM information. At each java command we supply the recomended settings.
##Uploading the results We use an FTP server to collect all the results and log files. You can obtain an account via bonder.m.j @ gmail.com. If you need any advice how to work with a FTP server feel free to ask for help. Please make sure to also upload the log files of the steps where asked.
#Step by step meQTL analysis This is a step by step guide through the meQTL mapping process. Our general method consists of seven steps described below:
- Step 1 - Preparation of methylation data
- Step 2 - Preparation of genotype data
- File Checklist
- Step 3 - Methylation data normalization
- Step 4 - MixupMapper
- Step 5 - Determining PCs to remove
- Step 6 - Performing the cis-QTL analysis
- Step 7 - Performing the trans-QTL analysis
##Definitions Througout the manual, references to different files/folders will be made. Here is an overview of these items:
- The methylation file will be referred to as
traitfile
- The methylation annotation file will be referred to as
annotationfile
- The full path of your genotype data will be referred to as
genotypedir
- The file linking methylation id to genotype id will be referred to as
genotypemethylationcoupling
- The file containing covariates will be defined as
covariatefile
Descriptions of each of these files and their usage is detailed below, and their formats are described in the data formats section which can be found in the general QTL mapping documentation.
##Step 1 - Preparation of methylation data As our software uses a nonparametric test by default, you can use virtually any continuous data as trait values to map a variety of QTL effects. However, currently the normalization tools provided with this package are focused on array based (Illumina) expression data, preprocessed RNA-seq data (e.g. transcript level quantified data), (GC)-RMA processed Affymetrix data and include several steps for the 450K methylation array. Therefore we first need to perform a part of the normalization in a separate program.
###Illumina 450K array based methylation data We chose to use "DASEN" normalization for the big cis- and trans- meQTL meta-analysis, we provide an R pipe-line to process the 450K data and normalize the data as outlined in the analysis plan.
Please start processing your data using the 450K_DataProcessing pipeline. Download all files using the option "Download ZIP".
Install all required packages.
Continue with the already designed user-friendly script. Comments in the script contain further descriptions of commands and information about running the script. Before you start the normalization, please read these instructions carefully.
450K data normalization script requires four separate folders:
- Folder with the raw methylation data;
- Folder for the normalized files created by the script;
- Folder with the source code (created automatically);
- Folder with the annotation files (created automatically).
After processing the data from the original .idat files using the 450K_DataProcessing pipeline to a'.txt' matrix with methylation values. The methylation data is now ready to be used in step 3, please continue with pre-processing the genotype data.
##Step 2 - Preparation of genotype data Our software is able to use both unimputed called genotypes, as well as imputed genotypes and their dosage values. The software, however, requires these files to be in TriTyper format. We provide Genotype Harmonizer to harmonize and convert your genotype files. However, we first need to calculate MDS components to correct the methylation data for population stratification.
###Step 2A - Correction for population stratification.
You will need to correct the expression data for 4 MDS components during the gene expression normalization step. First we need to get our data into the PLINK format, it is recommend to use the binary PLINK format and you should therefore have these files: mydata.bed, mydata.fam and mydata.bim.
📌 Note: this step should be performed using the non imputed autosomal data. If you run into problems feel free to contact bonder.m.j @ gmail.com for advice. It is also recommended to use pruned results. Please have a look at the PLINK manual (http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#prune) on how to obtain this.
Because PLINK uses complete linkage agglomerative clustering, based on pairwise identity-by-state (IBS) distance we first need to execute this command:
plink --bfile mydata --genome
which creates the file plink.genome.
This file will have as many rows as there are unique pairs of individuals in the sample. For large samples with thousands of individuals, this file will take a considerable time to generate.
To obtain these MDS components, you can use a simple command using the PLINK tool to create a multidimensional scaling matrix. The command line will look like this:
plink --bfile mydata --read-genome plink.genome --cluster --mds-plot 4
This command creates the file plink.mds which contains one row per individual with the following fields:
- FID - Family ID;
- IID - Individual ID;
- SOL - Assigned solution code;
- C1 - Position on first dimension;
- C2 - Position on second dimension;
- C3 - Position on third dimension;
- C4 - Position on fourth dimension;
You can find more information on PLINK website
Note it is not recommended to exclude samples at this step. This can easily be done at a later stage using the genotype to expression coupling file. Having all samples will help in step 4 to more accurate resolve sample mixups should they exist.
✅ Please upload your log file to the central server.
###Convert data to the TriTyper fileformat. For the QTL mapping the data needs to be in TriTyper format. Using GenotypeHarmonizer one can harmonize and convert genotype formats. In this step we will directly harmonize, filter and convert the genotype data. The data is harmonized to be matched to the GIANT release of 1000G. You can either run this step per each chromosome separately or for all chromosomes at once.
❗ Important When you are using imputed genotypes, make sure to use as an input format a file that contains the probabilities or dosages and not just genotype calls. In case of doubt please contact bonder.m.j @ gmail.com
📌 Note: please make sure to use verion 1.4.15 of the Genotype Harmoinzer.
java -Xmx10g -Xms10g -jar ./GenotypeHarmonizer.jar -i {locationOfInputData} -I {InputType} -o {Outputlocation} -O TRITYPER -r {locationOf1000G-GiantVcfs} --refType VCF_FOLDER --update-id -cf 0.95 -hf 0.0001 -mf 0.01 -mrf 0.5 -ip 0
Input arguments: -i
location of the input data, -I
input type, -o
output location, -O
output type TRITYPER, -r
location of the reference data, --refType
is the type of reference data. See the Genotype Harmonization manual for further details on the input flags.
✅ Please upload your logs file to the FTP server.
If you ran the GenotypeHarmonizer per chromosome you now need to merge all TriTyper folders per chromosome to one TriTyper folder containig data from all chromosomes. By using the following command you merge the individual TriTyper folders.
java -Xmx2g -Xms2g -jar ./eqtl-mapping-pipeline.jar --imputationtool --mode concat --in {folder1;folder2;folder3;ETC} --out {OutputFolder}
Input arguments:--in
is a semicolon-separated list of input TriTyper Folders with information per chromosome, --out
is the output location of the merged TriTyper data. Note: in case of problems try putting quotes ("
) around your list of input folders
Great! You now have your genotype and methylation data ready to go!
#File Checklist Before continuing, now is a good time to check whether your files are in the correct format and whether you have all the required files ready. Please check the following:
- All required TriTyper files are in your
genotypedir
. - The methylation files are formatted properly. (Check the "results" directory, for the "ILLUMINA450K_Mval.txt" file).
- You have an
annotationfile
(you should have downloaded this from the FTP at the start). - The identifiers of samples are identicial in your methylation and genotype files. If not, you should make a file to link the proper samples. This file is called a
genotypemethylationcoupling
file, the format is described here: Genotype - phenotype coupling. This file is not necessary if the names are already matched.
##Step 3 - Methylation data normalization Now let's perform an additional normalization to normalize the methylation data to an optimal format for the meQTL mapping. For the methylation data we will perform the following steps:
- Quantile normalization
- Probe centering and scaling (Z-transformation): (MethylationProbe,Sample = MethylationProbe,Sample - MeanProbe) / Std.Dev.Probe
- Removal of covariates - Correct gene methylation data for first 4 MDS of the GWAS data - to remove possible population stratification
- We run a Generalized Linear Model with the probes as dependent variables, and the GWAS MDS as
- Principal component adjustment
- In the Principal Component Analysis, the software tries to convert the normalized and standardized expression results into a set of values of uncorrelated variables, called Principal Components (PCs). The number of PCs is equal to the number of samples, and the first PCs explain a higher variance than the last PCs. By adjusting the normalized and standardized expression data for a set of PCs (like we did with the removal of the covariates - use of a Generalized Linear Model), we try to remove batch effects in the data. By removing them in an incremental way, we try to find the optimal number of PCs to remove.*
Run these steps using the following command:
java -Xmx15g -Xms15g -jar eqtl-mapping-pipeline.jar --mode normalize --in {traitfile} --out {outdir} --qqnorm --centerscale --adjustcovariates --cov {covariatefile} --adjustPCA --maxnrpcaremoved 0 --stepsizepcaremoval 0 2>&1 | tee {outdir}/normalization.log
For details on the covariate file see this link. We want to use the Plink MDS results as confounders for our analysis. Before we can use the plink mds results we need to run this command: awk '{print $2,$4,$5,$6,$7}' < plink.mds > plinkMds.txt
. Please make sure that the samples names in this file correspond to the sample names in the expression data.
After these steps a tab-separated gzipped plaintext file is created with an identical number of rows and columns as the input file. This means these files can subsequently be used during meQTL mapping.
Use the command below to extract the first 2 principle components, please plot these components. At most you might observe a few clear outliers, please remove these samples from the analysis. In case of doubt feel free to contact bonder.m.j @ gmail.com for support, please provide a plot of the first component.
zcat ***.QuantileNormalized.ProbesCentered.SamplesZTransformed.PCAOverSamplesEigenvectors.txt.gz | awk 'BEGIN {OFS="\t"} NR == 1 {print "Sample","Comp1","Comp2"} NR > 1 {print $1,$2,$3}' > pc1_2.txt
✅ Please upload your log file to the FTP server.
###Check your data
Running the general normalization procedure yields a number of files in the directory of your traitfile
, or in the outdir
if you specified one. The default procedure will generate file suffixes listed below. Suffixes will be appended in the default order as described above. Selecting multiple normalization methods will add multiple suffixes. Please do not remove or replace ANY intermediate files as they may be used in subsequent steps.
Suffix | Description |
---|---|
QuantileNormalized | Quantile Normalized trait data. |
ProbesCentered | Probes were centered. |
SamplesZTransformed | Samples were Z-transformed. |
CovariatesRemoved | Methylation data was adjusted for covariates. |
PCAOverSamplesEigenvalues | Eigenvalues created during eigenvalue decomposition of the methylation sample correlation matrix (created from the Quantile Normalized, Z-transformed data) |
PCAOverSamplesEigenvectors | Eigenvectors created during eigenvalue decomposition of the methylation sample correlation matrix (created from the Quantile Normalized, Z-transformed data) |
PCAOverSamplesEigenvectorsTransposed | Eigenvectors transposed |
PCAOverSamplesPrincipalComponents | Principal Components describing the sample correlation matrix (created from the Quantile Normalized, Z-transformed data) |
##Step 4 - MixupMapper We have shown in a paper published in Bioinformatics (Westra et al.: MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects), that sample mix-ups often occur in genetical genomics datasets (i.e. datasets with both genotype and methylation data). Therefore, we developed a method called MixupMapper, which is implemented in the QTL Mapping Pipeline. This program performs the following steps:
- At first a cis-meQTL analysis is conducted on the dataset:
- using a 250 kb window between the SNP and the mid-probe position
- performing 10 permutations to control the false discovery rate (FDR) at 0.05
- Calculate how well the methylation data matches the genotype data. For details how this exactly works, please have a look at the paper.
###Preparations/variables
- Note down the full path to your TriTyper genotype data directory. We will refer to this directory as
genotypedir
. - Determine the full path of the methylation data produced in step 3, the file ending with "CovariatesRemoved.txt.gz". We will refer to this path as
traitfile
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which will be GPL13534 for the methylation 450K array. - (Optional) Locate your
genotypemethylationcoupling
if you have such a file. You can also use this file to test specific combinations of genotype and phenotype individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
###Commands to be issued The MixupMapper analysis can be run using the following command:
java -Xmx15g -Xms15g -XX:StringTableSize=319973 -jar eqtl-mapping-pipeline.jar --mode mixupmapper --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform GPL13534 --inexpannot {annotationfile} --snps {snpfile} (--testall) (--gte {genotypephenotypecoupling}) 2>&1 | tee {outdir}/mixupmapping.log
We want to test a subset of SNPs as testing a full 1000G imputed dataset will take a lot of time. To do so we can use this snpfile and adding this flag: --snps {snpfile}
(remember to use the full path).
If your genotypes and/or methylation data include more samples than those who are matched based on sample names, it is beneficial to test all possible combinations. This can be performed using the following command line switch --testall
.
If you are running the software in a cluster environment, you can specify the number of threads to use (nrthreads
) by appending the command above with the following command line switch --threads nrthreads
(nrthreads should be an integer).
Note that the --gte
, --threads
and --snps
flags are optional, --gte
only applies if you are using a genotypephenotypecoupling
file.
###Check your data
MixupMapper is a two stage approach. As such, the default procedure creates two directories in the outdir
you specified: cis-meQTLs and MixupMapping. Both folders contain a different set of output files, described below.
####cis-meQTLs directory This directory contains output from a default cis-meQTL mapping approach. However we are not interested in this folder, these files are just necessary to create the actual mixupmapping output which can be found in the MixupMapping directory.
####MixupMapping directory
File | Description |
---|---|
Heatmap.pdf | Visualization of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the methylation samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colours. Samples are sorted alphabetically on both axes. |
BestMatchPerGenotype.txt | This file shows the best matching trait samples per genotype: the result matrix (MixupMapperScores.txt) is not symmetrical. As such, scanning for the best sample per genotype may yield other results than scanning for the best sample per trait. |
BestMatchPerTrait.txt | This file shows the best matching genotype sample per trait sample. |
MixupMapperScores.txt | A matrix showing the scores per pair of samples (combinations of traits and genotypes). |
In the BestMatchPerGenotype.txt and BestMatchPerTrait.txt files, you can find the best matching trait sample for each genotyped sample and vice versa:
- 1st column = genotyped sample ID, or trait sample ID dependent on file chosen (see above)
- 2nd column = trait sample originally linked to genotype sample ID in column 1, or genotype sample originally linked with trait sample in column 1
- 3rd column = the MixupMapper Z-score for the link between the samples in column 1 and 2
- 4th column = best matching trait (for BestMatchPerGenotype.txt) or best matching genotype (for BestMatchPerTrait.txt)
- 5th column = the MixupMapper Z-score for the link between the samples in column 1 and 4
- 6th column = this column determines whether the best matching trait or genotype is identical to the sample found in column 2
Example of BestMatchPerGenotype.txt
Trait OriginalLinkedGenotype OriginalLinkedGenotypeScore BestMatchingGenotype BestMatchingGenotypeScore Mixup Sample1 Sample1 -11.357 Sample1 -11.357 false Sample2 Sample2 -15.232 Sample2 -15.232 false Sample3 Sample3 -3.774 Sample4 -6.341 true Sample4 Sample4 -3.892 Sample3 -12.263 true
In case of the example there are two mixups indentified. If you don't have any mix-ups you are done, otherwise you need to resolve or remove these mix-ups. Please check this separate document for instructions on fixing mixups.
Always rerun the Sample Mix-up Mapper after removing or changing sample IDs and check if you actauly fixed/removed mixups instead of creating more!
##Step 5 - Removing principal components Prior to QTL mapping, we would like to remove the first 25 components to increases power to detect cis- and trans-QTLs. As we have seen that some PCs explain genetic variation, we perform our analysis on data only adjusted for PCs which explain no genetic variation. To identify the PCs that have no genetic association, we perform a QTL mapping on the principal component eigenvector matrix (PCAOverSamplesEigenvectorsTransposed.txt.gz). We call PCs genetically associated when they have a FDR of 0 (thus selecting truly significantly affected components only).
###Preparations
- Note down the full path to your TriTyper genotype data directory. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, this will be the file created in step 3 before PC correction. File name will end with "CovariatesRemoved.txt.gz".
- (Optional) Locate your
genotypephenotypecoupling
if you have such a file. You can also use this file to test specific combinations of genotype and phenotype individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
###Commands to be issued To run the mapping on the eigenvectors and to subsequently remove the PCs which are not under genetic control, use the following command:
java -Xmx20g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --metamode nongeneticPcaCorrection --in {TriTyper location} --inexp {traitfile} (--gte {gtefile}) --out {outdir} --adjustPCA --stepsizepcaremoval 25 --maxnrpcaremoved 25 (--snpselectionlist {snpselectionfile}) --nreqtls 1000000 2>&1 | tee {outdir}/removeComponents.log
Note that the --gte
switch only applies if you want to use a genotypephenotype
coupling file.
✅ Please upload your log file to the FTP server.
###Check your data
After running the nongeneticPcaCorrection command, the outdir
will contain the output of a full QTL mapping on the eigenvectors for your dataset. At the location of your input methylation data the program has written a new PC corrected methylation file, which was corrected for all components but the once under genetic control (FDR 0). So if the process has finished succesfully there should be a new file which ends on "25PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz". This is the traitfile which we are going to use in the subsequent cis-QTL mapping.
##Step 6 - Perform the cis-QTL analysis
❗ Please note the following
- For the steps below make sure to use version 1.2.4D of the eQTL mapping software.
The cis-QTL mapping can be performed in a single command. Standard settings for the cis-QTL mapping are: HWEP > 0.0001, MAF > 0.01, and Call Rate > 0.95. For cis-QTL mapping, the maximum distance between the SNP and the middle of the probe is 1,000,000 bp. To control for multiple testing, we perform 10 permutations, thereby shuffling sample labels, to calculate the false discovery rate (FDR) controled at 0.05.
###Preparations
- Note down the full path to your TriTyper genotype data directory. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, created in step 5.We will refer to this path as
traitfile
. - Locate your phenotype annotation file:
annotationfile
. Also note down theplatformidentifier
. - (Optional) Locate your
genotypephenotypecoupling
if you have such a file. You can also use this file to test specific combinations of genotype and phenotype individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
###Commands to be issued
The command needed to be issued for this cis-QTL analysis:
java -Xmx50g -Xms25g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp traitfile --inexpplatform GPL13534 --inexpannot {annotationfile} --cis --binary --perm 10 --maf 0.01 2>&1 | tee {outdir}/cisEQtlMapping.log
If you are running the software in a cluster or a multithreaded environment, you can specify the number of threads to use (nrthreads
) by appending the command above with the following command line switch --threads nrthreads
(nrthreads should be an integer).
Note that you should apply the --gte
switch if you are using a genotypephenotype
coupling file.
This section describes the binary output files generated using the --binary
command line switch.
File | Description |
---|---|
Dataset.dat | Zscore matrix for the data analysis. |
Dataset-ColNames.txt.gz | Column names for the Zscore matrix. |
Dataset-RowNames.txt.gz | Row names for the Zscore matrix. |
Dataset-PermutationRound-n.dat | Zscore matrix for the permuted data analysis |
Dataset-PermutationRound-n-ColNames.txt | Column names for the permuted Zscore matrix. |
Dataset-PermutationRound-n-RowNames.txt | Row names for the permuted Zscore matrix. |
excludedSNPsBySNPFilter.txt.gz | These SNPs are excluded by the user. |
excludedSNPsBySNPProbeCombinationFilter.txt.gz | There are no gene methylation probes located within 250.000bp of this SNP position. |
ProbeQCLog.txt.gz | This file describes which probes have been excluded from analysis. |
SNPQCLog.txt.gz | This file contains QC information for all tested SNPs, including MAF, HWE-pvalue and call-rate (which is always 1.0 for imputed SNPs). |
UsedSettings | This file the settings used for the meQTL mapping. |
✅ Please uploads all the above mentioned result files
###Checking SNPs with proxies within a probe To make sure that we can later validate that an meQTL is not an artefact of a proxy SNPs within the probe binding region we run the script below. The validCombinations file contains all SNP-probe combinations that are save to analyse.
java -jar -Xmx15g -Xms15g eqtl-mapping-pipeline.jar --metamode determineSnpProbList -dp 0.2 -r 0.2 -ri {genotypedir} -rt TRITYPER -pm 1 -i {annotationfile} -o {outdir}/validCombinations.txt 2>&1 | tee validCombinations.log
###Cis-meQTL mapping completed
🎉 You are done with the cis-meQTL analysis. Please upload your results, the validCombinations file and your logs and drop a mail to bonder.m.j @ gmail.com
##Step 7 - Perform the trans-QTL analysis :construction: Step 7 will be available once step 6 is finished for all datasets.