Run variance meQTL analysis - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki
MODULE STATUS
Developer: Xiaopu Zhang
Scripts status: Ready
Prerequisite scripts: scripts 00, 01, 02a-02b, 03a-03g, 04
Data upload method: automated upload to sftp to Bristol
Background
Gene-gene (GxG) and gene-environment (GxE) interactions lead to the increase of DNA methylation variance. A comprehensive search of genome-wide GxG or GxE interaction effects is impractical. Detecting genetic effects associated with DNA methylation variance (variance meQTL or vmeQTLs) is an alternative and efficient way to filter out candidate SNPs, which are likely to be involved in GxG or GxE interactions. See Yang et al (2012).
Strategy
vmeQTL detection will be a two-phase analysis, with a focus on EPIC methylation data and autosomal chromosome (module 07 scripts will ignore chr 23 as the default). Cohorts participated in vmeQTL analysis should be involved in both phases.
In phase 1, we will use BF(Levene's test with median value), DRM and SVLM tests to detect vmeQTLs. These methods were selected from performance comparison on simulated data (see Zhang & Bell (2024)). The current GoDMC pipeline only releases the code for the first phase.
In phase 2, we will send a list of SNP-CpG pairs and scripts to each cohort (scripts are in development).
1/ Cohorts will report the linkage disequilibrium between meQTLs (observed from 04 scripts) and vmeQTLs of each candidate CpGs (mandatory).
2/ Depending on the modifiers information each cohort has (eg. age, sex, cell type, ...), cohort will conduct interaction analysis (optional).
Analysis
We will use a new version of (OSCA) for this analysis.
We highly recommend cohorts change setting genetic_chunks
in config file to a small number, eg. 100 (default is 1000) to save their running time for module 07, because the number of array jobs is in the proportion of genetic_chunks
If you need to change genetic_chunks
here, please rerun 02b (takes ~2min) and then run module 07.
Please contact Xiaopu Zhang if you have any doubts before running.
Description of vmeQTL scripts
07a-vmeQTL_data_preparation.sh
The first script converts genetic data and DNA methylation data to OSCA input. Genetic data tabfiles (${home_directory}/processed_data/genetic_data/tabfile/data.tab.[1-100]
) are converted to plink format: (.bed
, .fam
, .bim
). DNA methylation data are divided into 22 chromosomes and then converted to OSCA format (.bod
, .opi
, .oii
) here: ${home_directory}/processed_data/methylation_data/vmeQTL/meth_vmeQTL_input.chr[1-22]
. The CpG annotations are updated by OSCA.
How to run script 07a:
bash 07a-vmeQTL_data_preparation.sh
Note:
- The option
–methylation-m
in script 07a does not mean that the DNA methylation data are M values. We use this option because the–methylation-beta
option requires input methylation data ranging from 0 to 1, otherwise it reports errors. The adjusted DNA methylation data from the pipeline does not meet this requirement, so we use the–methylation-m
option instead. - The provided annotation files
{home_directory}/resources/methylation/vmeQTL/EPIC_annotation_autochr.opi
include CpG annotations from EPIC array version1 and version2.
After running 07a, please check if you have plink format genetic data tabfiles here:
source config
ls ${home_directory}/processed_data/genetic_data/tabfile/data.tab.*.bed
ls ${home_directory}/processed_data/genetic_data/tabfile/data.tab.*.bim
ls ${home_directory}/processed_data/genetic_data/tabfile/data.tab.*.fam
Please check your data has been converted to OSCA format:
ls ${home_directory}/processed_data/methylation_data/vmeQTL/meth_vmeQTL_input.chr*.bod
ls ${home_directory}/processed_data/methylation_data/vmeQTL/meth_vmeQTL_input.chr*.opi
ls ${home_directory}/processed_data/methylation_data/vmeQTL/meth_vmeQTL_input.chr*.oii
07b-run_cis_vmeQTL.sh
To save storage and running time, we only identify cis-vmeQTLs. Three variables are required as input to 07b scripts - $1
is the vQTL methods (BF, SVLM or DRM), $2
is the chromosome (1-22), and $3
is the genetic chunk. Because we only run cis-vmeQTL, we submit 07b jobs by each chromosome, i.e. when $2
is 1, $3
will be the genetic chunk index which contains SNPs from chromosome 1.
To do this, you can run example code from here. You can find a detailed description about the code below.
for i in $(seq 1 ${genetic_chunks})
do
chr=`cut -f 1 ${tabfile}.tab.$i.bim | sort | uniq`
echo $i $chr >> ${section_07_dir}/tabfile.info
done
count=`awk 'BEGIN{FS=" "}{print NF}' ${section_07_dir}/tabfile.info | sort | uniq`
for i in $count
do
cut -d ' ' -f $i,1 ${section_07_dir}/tabfile.info >> ${section_07_dir}/tabfile.temp
done
awk 'BEGIN{FS=" "}{if($2>0) print}' ${section_07_dir}/tabfile.temp | sort -k1,1n | uniq > ${section_07_dir}/tabfile.info1
rm ${section_07_dir}/tabfile.info
rm ${section_07_dir}/tabfile.temp
The first two columns of ${section_07_dir}/tabfile.info1
represent the genetic chunk and the SNP chromosome. In the tabfile.info1
example below you find that SNPs from the genetic chunks 1-8 are on chr1 and SNPs from genetic chunks 8-9 are on chr2.
head ${section_07_dir}/tabfile.info1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
8 2
9 2
Then the 07b script submission can be:
cat ${section_07_dir}/tabfile.info1 | while read line;
do
genetic_chunk=`echo $line | cut -d " " -f 1`
chr=`echo $line | cut -d " " -f 2`
sbatch 07b-run_cis_vmeQTL.sh drm $chr $genetic_chunk
sbatch 07b-run_cis_vmeQTL.sh svlm $chr $genetic_chunk
sbatch 07b-run_cis_vmeQTL.sh BF $chr $genetic_chunk
done
Note:
- Running time for each job is ~1-2 hours with 500 samples.
- The total storage of summary statistics is expected to be <200G per cohort.
- There is a difference in the OSCA
cis-wind
option in the 07b script. The BF method (-cis-wind 2000
) was implemented with the old OSCA version and the-cis-wind
option representskb
. For the SVLM/DRM methods (-cis-wind 2000000
) we use a more recent OSCA version where the-cis-wind
option representsbp
. Please contact Xiaopu Zhang if your 07b did not finish within 2 days!
07c-run_cis_vmeQTL.sh
We will use the 07c script to check whether all 07b jobs have successfully run and to resubmit failed jobs.
Please run:
bash ./07c-check_vmeQTL_results.sh BF
bash ./07c-check_vmeQTL_results.sh drm
bash ./07c-check_vmeQTL_results.sh svlm
Please rerun 07c after all resubmitted jobs are finished to ensure you got output for all chunk jobs
using BF method to detect vmeQTLs of genetic chunk:[1-100] and chr: [1-22] is successful
using drm method to detect vmeQTLs of genetic chunk:[1-100] and chr: [1-22] is successful
using svlm method to detect vmeQTLs of genetic chunk:[1-100] and chr: [1-22] is successful
07d-tar_results.sh
We will use 07d-tar_results.sh
for tarring the results file. Results of each chromosome will be tarred separately. Please submit this scripts in parallel of all chromosomes or use scripts submission_scripts_template/godmc_07d_template.sh
directly.
sbatch ./07d-tar_results.sh [1-22]
All required scripts are finished here. Thanks for your efforts!
Now upload the results
To check that everything ran successfully, please run:
./check_upload.sh 07 check
This should tell you that Section 07 has been successfully completed!
. Now please go to the Results directory and check your results.
cd results/07
Now please go back to the godmc_phase2 directory and upload the results like this:
./check_upload.sh 07 upload
Please note it may take around 1-2 hours to upload the 07 module results
It will make sure everything looks correct and connect to the sftp server. It will request your password (this should have been provided to you along with your username). Once you have entered your password it will upload the results files from section 07
.