Run GWAS of age acceleration - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki

MODULE STATUS

Developers: Dr Eilis Hannon and Siyi Wang

Scripts status: Ready

Prerequisite scripts: ./00-setup_folders.sh, ./01-check_data.sh, ./02-snp_data.sh, ./03a-methylation_variables.sh

Data upload method: Manual upload to GoogleDrive after completion of scripts in module 11

GWAS of age acceleration

We will perform a number of genome-wide association anlayses of biological age biomarkers where epigenetic clocks are used to derive measures of biological aging. We have selected three epigenetic clocks to use as proxies for biological again: Horvath’s clock, Levine’s clock, and DunedinPACE clock. We have selected one example of each generation of epigenetic clock. Given the vast number of clocks available, this could have been a never ending list so we have made a practical decision to limit the quantity of data we generate.

Generation of the biological aging measures will be performed by the pipeline. Please make sure you've run the ./01-check_data.sh, ./02-snp_data.sh and ./03a-methylation_variables.sh before you run the following script. For each cohort with sample size less than 1000, this script will take 10-30min to finish.

To initiate the analyses, execute the following script:

./10a-gwas_aar.sh -c /path/to/your/config/file

[!WARNING] There is a possibility of encountering a memory allocation error when loading the methylation data, especially if the dataset exceeds the memory capacity of the login or compute node. If the system fails to allocate enough memory, you may see an error message like the following:
Error：cannot allocate vector of size XX.XX Gb
Execution halted
To resolve this issue, please submit the script as a batch job using the SLURM. Below is an example of a job submission script:
#!/bin/bash
#SBATCH --export=ALL
#SBATCH --mail-type=END
#SBATCH --output=1011-%j.out
#SBATCH --job-name=GoDMC_10a
#SBATCH --nodes=1
#SBATCH --mem=100G
#SBATCH --ntasks=8
#SBATCH --time=0-5:00:00

./10a-gwas_aar.sh -c /path/to/your/config/file

In cohorts where age is a variable phenotype, there will be a total of six estimated "age acceleration" measures (two for each clock). These will be treated as phenotypes in separate GWAS. GCTA will be utilized for fastGWA computations.

The script will systematically process each phenotype, performing the following steps:

Re-calculate the epi smoking score and the epi age accelerations by using the same scripts from 03a. (This is a necessary step for user who don't want to re-run the 03a after the bug fixed.)
Compute a sparse genetic relationship matrix (GRM) for autosome.
Execute fastGWA.
Generate Manhattan plots, QQ-plots based on the GWAS result.
There will be two sets of GWAS results. One is ${section_10_dir}/${clock_name}.fastGWA and another one is ${section_10_dir}/${clock_name}_PCA.fastGWA.

[!IMPORTANT] Please check the following graphs under the path results/10:

age_prediction.pdf - shows correlation between predicted and actual ages.

age_prediction_density.pdf - shows the distribution across all six age accerlations and actual ages. If the cohort includes both males and females, the density plot of all six age accerlation based on the sex will be generated.

age_prediction_box_violin.pdf - show the distribution with some key statistics across all six age accerlations and actual ages.

age_prediction_correlation.png - shows the correlations among every predicted age, age acceleration residuals, and chronological age.

manhattan.pdf - This graph displays SNPs with a P-value less than 0.01. The y-axis(starts from 2) represents the -log10 of the P->value, while the x-axis indicates the position on the chromosome. Please make sure the manhatten plot did contains data points.

qqplot.png - Observed P-values for each SNP, sorted from largest to smallest are plotted against the expected values from a >theoretical χ2-distribution. The genomic lambda may vary from 0.86 and 1.07, if the value is outside of this range this may indicate inflation. Please >check the lambda score on the top of the each qq-plot. Here is the example:

Please also check the following files with descriptives:

age_prediction_stats.csv - contains summary statistics for each clock, smoking variable, and their corresponding age acceleration modules.

age_prediction_stats_corrsd.csv - provides correlation values and standard deviations for each pairwise comparison, as shown in age_prediction_correlation.png.

age_prediction_stats_models.csv - includes statistics derived from linear models for each clock.

age_prediction_stats_sex.csv - contains model statistics stratified by sex. This file is only generated if the cohort includes both males and females.

Hertability of age acceleration

Since the SNP heritability measures the proportion of phenotypic variance explained by all measured SNPs, accurate estimation of SNP heritability can help us better understand the degree to which measured genetic variants influence phenotypes.

To initiate the analyses, execute the following script:

./10b-heritability_aar.sh -c /path/to/your/config/file

For each phenotype, the result will be saved in the .hsq file under the results/10 folder. The script will take < 1 min. More details about .hsq file can be found in GCTA.

In the .hsp file, there are 10 parameters:

V(G) is the genetic variance;
V(e) is the environmental variance;
Vp is the phenotypic variation;
V(G)/Vp is the heritability;
logL is the log likelihood for the full model;
logL0 is the log likelihood for the residual model;
LRT is 2[logL - logL0] which is distributed as a mixture of 0;
df is the degreed of freedom for chi-squared;
Pval is the p value;
n is the sample size;

The .hsp file should looks like this: