Case study: enrichment analysis in GWAS - xqwen/dap GitHub Wiki

Enrichment Analysis in GWAS

In this example, we study the enrichment of functionally annotated genetic variants associated with high density cholesterol (HDL). In this analysis, we use the z-scores of single-SNP association testing of HDL association (Pickrell, 2014) and functional annotations derived from ENCODE data by Gusev et al, 2014.

Sample Data Download

z-scores from single-SNP association testing: HDL.z_score.gz
SNP annotation file: 1000G.annot.gz

Input Data Format

Z-scores from single-SNP association testing

Z-scores from single-SNP association testing are organized in the following formt

   chr1:998395  Loc1  -0.178471
   chr1:1000156  Loc1  -0.169669
   chr1:1001177  Loc1  -0.247359
   chr1:1002932  Loc1  -0.240580
   chr1:1003629  Loc1  -0.169000
   chr1:1004957  Loc1  -1.145393
   chr1:1004980  Loc1  -1.145393
   chr1:1006223  Loc1  -1.174756

The first column denotes the SNP IDs, and the second column indicate the LD block of the corresponding SNP. Note, the LD blocks are defined based on the results of Berisa and Pickrell, 2015. The last column represents the z-scores.

SNP annotation file

The SNP annotation file contains SNP-level genomic annotations used by TORUS analysis. The annotation file uses a header to specify the number and the nature (categorical or continuous) of the anntations. For example,

SNP  annot_d
chr1:226580387 5
chr1:162736463 5
chr1:222359612 0
chr1:157255396 0
chr1:95166832 0
chr1:66857915 0
chr1:63432716 4
chr1:8640831 5
chr1:209894785 5

The first column with the header "SNP" represents the SNP name. The following columns represent specific annotations.For categorical/discrete annotations, the header should always have a suffix "_d"; whereas for continuous annotations, the header should ends with "_c". Note that, if a SNP is not annotated (i.e. not appeared) in the annotation file, the default category 0 (i.e., the baseline) is assigned. Nevertheless, we strongly recommend users to annotate all the candidate SNPs.

In this particular annotation file, the code for the categories represents: 1-coding SNP; 2-utr region; 3-promoter region; 4-DHS region; 5-Intron; 0-baseline/all others.

Finally, all input files should be gzipped.

Running Enrichment Analysis

The compiled binary executable torus is required to run the enrichment analysis. Use the following command to start the enrichment analysis

 torus -d HDL.z_score.gz -annot 1000G.annot.gz  -est --load_zval > HDL.enrichment.est

In particular,-est instructs TORUS to output the 95% confidence intervals for each estimated enrichment parameter; --load_zval informs TORUS that the input summary-statistics are z-scores (alternatively, Bayes factors can be pre-computed).

Output from enrichment analysis

The results for enrichment analysis is directly output to the screen, and can be re-directed to a file (in our example, "gtex_liver.enrichment.est"). The output has the following format

Intercept    -11.523        (-11.549, -11.497)
  annot.1      4.684        (2.010,     7.359)
  annot.2      4.585        (0.848,     8.321)
  annot.3      3.940        (2.323,     5.558)
  annot.4      1.688        (0.475,     2.902)
  annot.5      1.550        (0.959,     2.141)

The first column represents the annotation name and its corresponding level (for a categorical variable). The second column is the point estimate (MLE) of the log odds ratio. Columns 3-4 represent the 95% confidence interval for the corresponding point estimate. It is a feature of GWAS enrichment analysis that confidence interval can be much larger in comparison to molecular QTL mapping (due to relatively less strong association signals).