eQTLGen additional analyses - molgenis/systemsgenetics GitHub Wiki
This supplement for eQTLGen cookbook gives instructions for additional analyses which are proposed in the framework of eQTLGen Consortium. It is expanded when new initiatives are agreed.
1. Tissue specifity of eQTL effects
This analysis was proposed and is conducted by Holger Kirsten and Markus Scholz from Leipzig LIFE study (Leipzig University).
The purpose of this part is to investigate the tissue specificity of the eQTLs identified in eQTLGen consortium and replicate/investigate those in published tissue-specific eQTL datasets, such as GTEx. This will help to investigate the relationship between the expression level of the gene and detected eQTL effects.
To achieve this, we will collect the summary information about expression levels of the genes measured by each cohort. The reports generated in these steps contain descriptive statistics for each probe in your expression dataset (such as mean, median and standard deviation) and in how many samples the gene was actually detected. The latter is determined with the help of detection P-value determined by "pseudocontrol" probes corresponding to genes not expressed in blood, based on data from public databases. The summary is collected only for the samples used in the final cis/trans-eQTL analyses. Several files from previous steps are needed for defining this overlap.
For running this you need software R installed into your workspace. Additionally it needs two packages (data.table
and stringr
) and their dependencies. To install those, please open the R window in your working environment, run following command and follow the instructions:
install.packages('data.table', 'stringr')
In case you are working in UNIX cluster environment, you need to load the installed version of R with module load
command, as usual.
The script for extracting the expression information and file with pseudocontrol information is downloadable from here:
Script: https://www.dropbox.com/s/o0pk3gxukhl2ojf/script_extract_expression_statistics.R?dl=0
Pseudocontrol file: https://www.dropbox.com/s/hjz2r59mud0s84s/blood_pseudocontrols_ENSEMBL71_added_20170505.txt?dl=0
1.1 Expression statistics from raw data
Necessary arguments:
- Raw, un-preprocessed expression matrix where outlier samples are removed. This file is generated in Step 3B in the main cookbook (default name ExpressionData.SampleSelection.txt.gz and it is by default in the main expression data folder). If there were no expression outliers in your data or you used older version of pipeline in that stage, then the corresponding file name may be in the format ExpressionData.txt.gz.
If you are on UNIX platform, you can use gzipped version in the following command, otherwise please use unzipped copy of this file in the command below. We refer to this file as raw_expression_data
.
-
Genotype-expression file used for the cis/trans-eQTL analyses. We refer to this file as
gte_file
. This file was needed in Step 5 of the main cookbook (PC-correction of expression matrix). If you have the same sample IDs in genotype and expression data, you can construct this file by supplying the same IDs twice. Format is explained here: https://github.com/molgenis/systemsgenetics/wiki/File%20descriptions#genotype---phenotype-coupling -
PhenotypeInformation.txt file which is in the harmonized TriTyper folder (output of Step 2). We refer to this file as
PhenotypeInformation
. -
Pseudocontrol file which consists negative "pseudocontrol" probes for each Illumina array type. These were collected from GTEx and TiGER databases. File name: blood_pseudocontrols_ENSEMBL71_added_20170505.txt. We refer to this file as
BloodPseudoControlFile
and it is downloadable from here: https://www.dropbox.com/s/hjz2r59mud0s84s/blood_pseudocontrols_ENSEMBL71_added_20170505.txt?dl=0 -
Argument defining your cohort name in the format like: Fehrmann/LIFEa1/LIFEb3/ALSPAC/etc. We refer to this argument as
CohortName
. -
Argument defining your Illumina expression platform name in the format: HT12v3, HT12v4, HT12v4_WGDASL, H8v2ConvToHT12. We refer to this argument as
PlatformName
. -
Argument defining the full path to your output directory defined as
outdir
.
Commands to run:
Rscript script_extract_expression_statistics.R \
{raw_expression_data} \
{gte_file} \
{PhenotypeInformation} \
{BloodPseudoControlFile} \
{CohortName} \
{PlatformName} \
{outdir} 2>&1 | tee {outdir}/ExpressionStatistics_raw.log
This command produces two files into the output directory:
[CohortName]_[PlatformName]_expression_statistics_raw.txt
ExpressionStatistics_raw.log
Please upload both files.
1.2 Expression statistics from normalized data
Secondly we ask you to run the same command again but this time supply quantile-normalized and log-transformed expression matrix to the same command. This file name is by default ExpressionData.SampleSelection.QuantileNormalized.Log2Transformed.txt and it is by default in the same folder where is file used in previous step.
If there were no expression outliers in your data or you used older version of pipeline in that stage, then the corresponding file name may be in the format ExpressionData.QuantileNormalized.Log2Transformed.txt.gz.
Again, if you are on UNIX platform, you can use gzipped version in the following command, otherwise please use unzipped copy of this file in the command below.
We refer this file as quantile_log2_expression_data
.
All the other input arguments remain the same as in previous step.
Commands to run:
Rscript script_extract_expression_statistics.R \
{quantile_log2_expression_data} \
{gte_file} \
{PhenotypeInformation} \
{BloodPseudoControlFile} \
{CohortName} \
{PlatformName} \
{outdir} 2>&1 | tee {outdir}/ExpressionStatistics_quantile_log2.log
This command produces two files into the output directory:
[CohortName]_[PlatformName]_expression_statistics_quantile_log2.txt
ExpressionStatistics_quantile_log2.log
Please upload both files.
1.3 Upload expression statistics
After extracting the expression statistics, please upload these 4 files into new separate folder (named: expression_statistics) on your cher-ami sharing account.
The updated folder structure in the upload server is following (where LIFEa1/LIFEb3 should be replaced with your analysed cohort/dataset name):
|--LIFEa1
|--trans_eQTL_results
|--trans_eQTL_results_PC_corrected
|--trans_eQTL_results_PC_uncorrected
|--PRS_eQTL_results
|--PRS_eQTL_results_PC_corrected
|--PRS_eQTL_results_PC_uncorrected
|--expression_statistics
|--LIFEb3
|--trans_eQTL_results
|--trans_eQTL_results_PC_corrected
|--trans_eQTL_results_PC_uncorrected
|--PRS_eQTL_results
|--PRS_eQTL_results_PC_corrected
|--PRS_eQTL_results_PC_uncorrected
|--expression_statistics
The upload instructions are here: http://wiki.gcc.rug.nl/wiki/DataSharing
If the instructions below do not work for your cluster setup, please consult with the extended manual: https://github.com/molgenis/systemsgenetics/wiki/Using-sharing-server-for-eQTLGen-analyses
## one possibility is to use lftp, if this is installed in your environment
## use these commands to upload/download data from cher-ami server
# go to the local folder where data is stored:
cd local/folder/with/your/data/
# start lftp
lftp
# connect with your guest account:
lftp :~> open -u [your_guest_accountname],none -p 22 sftp://cher-ami.hpc.rug.nl
# make additional directory for the corresponding dataset e.g:
lftp :~> mkdir LIFEa1/expression_statistics
# go to the folder where you upload the results. e.g:
lftp :~> cd LIFEa1/expression_statistics
# use put command for uploading the individual files and/or mirror command for uploading whole directories
lftp :~> put individual_result_file_in_local_server
lftp :~> mirror -R result_folder_in_local_server
# exit from the remote server when uploaded all necessary files:
lftp :~> exit
Please send additional e-mail to urmo.vosa @ gmail.com when the expression statistics files have finished uploading.