Run GWAS of smoking - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki

MODULE STATUS

Developers: Dr Eilis Hannon and Siyi Wang

Scripts status: Ready

Prerequisite scripts: 00-setup_folders.sh, 01-check_data.sh, 02-snp_data.sh, 03a-methylation_variables.sh 10a-gwas_aar.sh 10b-heritability_aar.sh

Data upload method: Manual upload to GoogleDrive

Run GWAS of smoking exposure

A Genome-Wide Association Study (GWAS) will also be conducted for smoking using a DNA methylation-derived phenotype for smoking exposure.

The smoking score will be estimated from the DNA methylation data that captures cumulative exposure to smoking and used as a phenotype in a GWAS. Analyses will be adjusted for cell composition, sex and age where these covariates vary across your sample. Please make sure you've run the 01-check_data.sh, 02-snp_data.sh and 03a-methylation_variables.sh and 10a-gwas_aar.sh and 10b-heritability_aar.sh before you run the following scripts. This script might take 5-10 minutes to complete for a cohort with a sample size < 1000.

To run the GWAS, run the following script:

./11a-gwas_smoking.sh -c /path/to/your/config/file

The script will:

Confirm if cell composition has been estimated.
Generate phenotype (smoking score adjusted for relevent covariates)
Perform fastGWA on adjusted smoking score.
Generate Manhattan plots, QQ plots based on the GWAS result.

Please check the following graphs:

results/11/gwas_smoking_manhattan.pdf - This graph displays SNPs with a P-value less than 0.01. The y-axis(starts from 2) represents the -log10 of the P-value, while the x-axis indicates the position on the chromosome. Please make sure the Manhatten plot contains data points. Here is an example:

results/11/gwas_smoking_qqplot.png - This plot showcases observed P-values for each SNP, sorted from largest to smallest and plotted against expected values from a theoretical χ2-distribution.

Heritability of smoking exposure

Since the SNP heritability measures the proportion of phenotypic variance explained by all measured SNPs, accurate estimation of SNP heritability can help us better understand the degree to which measured genetic variants influence phenotypes. The script will take < 30s.

To run the heritability analyses, execute the following script:

./11b-heritability_smoking.sh -c /path/to/your/config/file

For each phenotype, the result will be saved in the .hsq file under the results/11 folder. More details about .hsq file can be found in GCTA.

In the .hsp file, there are 10 parameters:

V(G) is the genetic variance;
V(e) is the environmental variance;
Vp is the phenotypic variation;
V(G)/Vp is the heritability;
logL is the log likelihood for the full model;
logL0 is the log likelihood for the residual model;
LRT is 2[logL - logL0] which is distributed as a mixture of 0;
df is the degreed of freedom for chi-squared;
Pval is the p value;
n is the sample size;

Here is the example:

Check, compress and encyrpt results files for upload

To check that everything ran successfully and compress all the output files from 10 and 11, please run:

./11c-check_compress_data.sh -c /path/to/your/config/file

This indicates that two compressed files, namely AgeSmokGWAS2025_${study_name}.tgz.md5sum and AgeSmokGWAS2025_${study_name}.tgz.gpg, have been successfully generated and are now ready for upload to the GoogleDrive. Please follow the steps below to finish the upload:

Please download them to your local machine first;
Please upload the AgeSmokGWAS2025_${study_name}.tgz.md5sum and AgeSmokGWAS2025_${study_name}.tgz.gpg via this link https://drive.google.com/drive/folders/1CvrU4qDNJSS2J8a_MtlGm33UCDzVby5Y?usp=drive_link;

Please note that this step replaces the need to use the check_upload.sh script.

Thank you so much. We hope you've enjoyed running our pipeline.