Case study: multi SNP fine mapping - xqwen/dap GitHub Wiki

Multi-SNP Fine-mapping Analysis

It is important to note that multi-SNP fine-mapping analysis should be performed for each locus, all the data files should be organized for each locus separately before running the analysis.

Sample Data Download

A sample data set of gene ENSG00000112799 from the GEUVADIS project is included in the sample_data directory.

Sample phenotype-genotype data file from the GEUVADIS project: ENSG00000112799.dat
Sample grid file for cross-population eQTL analysis: grid
Sample prior file (generated by torus) accounting for SNP position to TSS only: ENSG00000112799.null.prior
Sample prior file (generated by torus) accounting for both SNP position to TSS and CENTiPEDE annotation for TF binding: ENSG00000112799.binding.prior

Input Data Format

Genotype-Phenotype Data File (Required)

This data file contains individual-level phenotype and genotype information required by genetic association analysis, and in plain text format. The data file allows information multiple subgroups (e.g. in a meta-analytic setting). Importantly, each data file for multi-SNP fine-mapping analysis should (ideally) be corresponding to a single locus.

The phenotype section contains lines starting with the keyword "pheno". Each line has the following format

  pheno  pheno_id subgroup_id pheno_ind_1 pheno_ind_2 ... pheno_ind_n

The pheno_id field contains a character string that denotes the name of the phenotype. The group_id field is a character string that uniquely labels a specific group. Note that the group IDs should be consistently used in the data file. The following entries are numerical values of phenotypes for individual 1 to n in the indicated subgroup. Note, because each subgroup can have different number of individuals, the length of each line can differs.

The genotype section directly follows the phenotype section. Each line contains the genotype information of a SNP for the samples in a subgroup, i.e.,

  geno snp_id  group_id geno_ind_1 geno_ind_2 ... geno_ind_n

The leading "geno" is a keyword that indicates the line encodes genotypes. The snp_id field contains a character string that denotes the ID of a SNP. The additional group_id field indicates the particular subgroup in which genotypes are measured. The remaining entries are genotypes of the target SNP for individual 1 to n in the subgroup coded in dosage format (i.e, 0,1 or 2, or any fractional numbers between [0,2] if the genotype is imputed).

In addition, variables to be controlled for can be added into input file. The format for covariates input is

  controlled covariate_id group_id value_ind_1 value_ind_2 ... value_ind_n

The keyword "controlled" indicates that the variable is a convariate to be controlled for.

Note that if the data only contains a single group, the group_id becomes nuisance. Nevertheless, it is still required.

Missing values in phenotype and genotype data are allowed, and they should be represented by "NA".

Effect Size Grid File (Required)

The grid file contains information on prior effect size specifications for causal variants. In the fine-mapping analysis, the effect size of each causal variant is treated as a nuisance parameter and integrated out. In our model, the effect size is defined on the scale of signal-noise ratio, and therefore unit-less. For the general meta-analytic setting, two (hyper-)parameters, phi and omega, are required: phi represents the heterogeneity of a genetic effect across multiple subgroups, and omega describes the average effect size. For commonly-used fixed-effect meta-analysis, simply specify phi=0. Alternatively, omega^2/(omega^2 + phi^2) represents the correlation of effects between subgroups, and (omega^2 + phi^2) represents the overall genetic effect which can be linked to heritability (for a SNP with frequency f, the hertiability explained by the single SNP association is roughly 2*f*(1-f)*(omega^2+phi^2)/[1+2*f*(1-f)*(omega^2+phi^2)]. In practice, we use a grid of (phi, omega) combinations to describe the long-tail behavior of the genetic effects observed in practice. If the data only contains a single study, only omega needs to be specified and phi can be conveniently set to 0. The mathematical details on the effect size prior specification can be found in Wen and Stephens, 2014 and Flutre et al,2013.

The grid file contains a two-column data matrix: the first column specifies phi and the second column specifies omega the average effect size (hyper)-parameter (omega). Each row of the grid data matrix provides a unique combination. A sample grid file for fixed-effect meta-analysis is given below.

0.0000  0.1000
0.0000  0.2000
0.0000  0.4000
0.0000  0.8000
0.0000  1.6000

The sample file used in cross-population cis-eQTL analysis allows some effect size heterogeneity for eQTLs across population groups.

Prior File (Optional)

The prior file specifies SNP-level prior for each candidate SNP. This file is optional, if not specified, the analysis assumes that prior for each SNP is 1/p, where p is the number of candidate SNPs in the locus. Note that, this default prior implies that the prior expected number of causal SNPs is 1 in the locus. For loci identified in the QTL discovery step, this default prior is likely conservative. Nevertheless, we highly recommend that users specify the SNP-level priors, especially when relevant genomic annotations are used, and their enrichment levels are quantified in the enrichment analysis step.

The prior file takes the following simple format

chr6.6488577  3.033e-04
chr6.6488581  3.033e-04
chr6.6488609  3.033e-04
chr6.6488817  3.033e-04
chr6.6488818  3.033e-04
chr6.6489565  2.575e-04
chr6.6489624  2.575e-04
chr6.6489755  2.575e-04
chr6.6489774  2.575e-04

The first column represents the SNP name and the second column represents the prior probability that the SNP is associated.

Automatic Prior Computation by TORUS

The SNP-level priors can be computed by the executable torus in the process of enrichment analysis. Essentially, the prior is computed using an empirical Bayes approach by plugging in the point estimate of the enrichment parameters. Use the command-line option "dump_prior direcotory_name" in either enrichment or QTL discovery analysis. Once the option specified, torus outputs prior files for each locus in the specified directory. e.g.,

torus -d geuv.summary.bf.gz --load_bf -smap geuv.snp.map.gz -gmap geuv.gene.map.gz -annot geuv.annot.gz  -est -dump_prior gene_prior > geuv.enrichment.est

Running Multi-SNP Fine-mapping Analysis

Running multi-SNP fine-mapping analysis requires binary executable dap. To start the analysis, type the following command

dap -d ENSG00000112799.dat -g grid -t 8 -it 0.05 -prior ENSG00000112799.binding.prior> ENSG0000112799.fm.rst

Important Command-line Options

The minimum required command-line options for running the analysis are "-d" (followed by phenotype-genotype data file name) and "-g" (followed by grid file name). Other options are all optional, however important for efficient computation. Here are some options are particularly important for efficiency:

-t thread_number: specify number of parallel threads in computation. Using parallel processing can greatly improve the efficiency of the fine-mapping analysis, DAP is implemented using OpenMP library and can take advantage of multi-thread computation. Because of the computational intensity, we strongly recommend running DAP in a multi-core environment. By default, the thread_number is set to 1. On an eight-core computer, running the sample data set using 8 parallel thread takes about 21 seconds real time; the single-thread execution takes about 100 seconds.
-it inclusion_threshold: the inclusion_threshold is a parameter that DAP algorithm used to screen marginally or conditionally noteworthy candidate SNPs. It is defined in a probability scale ranging from 0 to 1. If it is set to 0, DAP regards all candidate SNPs as noteworthy, and carries out the exact calculation (without approximation), which is computationally most expensive and typically not desired; when it is set to a relatively large value, e.g., 0.25, the behavior of the DAP becomes similar to commonly applied conditional analysis, which is fast but does not fully account for LD or explore the model space sufficiently. The default value is set to 0.01. However if a locus contains large number of SNPs and/or the LD pattern within a locus is complicated, the running time could be long. in such cases, we recommend to set a modest inclusion_threshold at 0.05, with which we observe very insignificant precision loss (comparing to the default 0.01 value).

Interpretation of Results

The output from the multi-SNP analysis contains full posterior information on possible association models and the inclusion probabilities of each individual SNP. A sample output is provided for reference.

The first portion of the output ranks the high-probability association models


    1   1.7496e-01    3     47.584   [chr6.6589277] [chr6.6626263] [chr6.6643657]
    2   1.1132e-01    3     47.388   [chr6.6589277] [chr6.6626263] [chr6.6640640]
    3   6.9947e-02    3     47.186   [chr6.6589277] [chr6.6626263] [chr6.6646966]
    4   4.7707e-02    3     47.020   [chr6.6589277] [chr6.6626263] [chr6.6641035]
    5   4.2509e-02    3     46.970   [chr6.6587194] [chr6.6626263] [chr6.6643657]
    6   3.8585e-02    3     46.928   [chr6.6588881] [chr6.6626263] [chr6.6643657]
    7   3.5041e-02    3     46.886   [chr6.6585976] [chr6.6626263] [chr6.6643657]
    8   3.2896e-02    3     46.858   [chr6.6589277] [chr6.6626263] [chr6.6644538]

The second column indicates the posterior probability of each association model, the third column represents the size (i.e., the number of member SNPs) of the corresponding model. Column 4 represents the un-normalized posterior score at the log10 scale, i.e., log10(Prior) + log10(BF), and the remaining of the row lists the members of the corresponding association model. An important feature to note is that the posterior probabilities on association models are typically spread out, e.g., the absolute value for the maximum-a-posteriori model is generally not very high. This is mostly because high LD SNPs are roughly "interchangeable": e.g., in the above example, SNPs chr6.6643657, chr6.6640640 and chr6.6646966 are in perfect LD, the posterior probability difference among the top three models are only due to the annotation difference.

The second portion of the output provides some useful summary posterior statistics

Posterior expected model size: 3.039 (sd = 0.200)
LogNC = 111.30984 ( Log10NC = 48.341 )

The posterior expected model size indicates the posterior expected number of independent association signals within the locus of interest. In the above example, this summary statistics indicates that there are roughly 3 independent association signals co-existing in the locus. logNC (and log10NC) gives estimated normalizing constant.

The last section of the output provides posterior inclusion probabilities for highly ranked SNPs

    1 chr6.6626263   9.99999e-01      6.125
    2 chr6.6589277   4.60582e-01     38.218
    3 chr6.6643657   3.96754e-01      4.253
    4 chr6.6640640   2.59461e-01      2.543
    5 chr6.6646966   1.59648e-01      1.734
    6 chr6.6587194   1.18782e-01     38.022
    7 chr6.6641035   1.08210e-01      2.412

The first and the second column gives the rank and the id of the SNPs, the third column represents the posterior inclusion probability (PIP) marginalized from the posterior model probabilities. The last column gives log10 Bayes factor of the corresponding SNP in single-SNP testing. By default, the output only contains SNPs whose PIPs > 1e-3. The user can opt to output PIPs for all the candidate SNPs by setting the command-line option "-all".

Utilities Aiding Posterior Interpretation

Although the output files contain the full posterior information on the fine-mapping analysis, those results sometimes can still be difficult to digest. One of the frequently asked questions is how to identify independent signal clusters and member SNPs for each cluster. We provide a few utility scripts in utility directory to facilitate posterior interpretations.

cluster\_pip.pl: this perl script takes dap output and uses an ad-hoc algorithm (described in Wen et al 2015 supplementary text section S.4) to cluster SNPs into user defined number of independent signal clusters. Internally, the perl script invokes an R script cluster\_pip.R to perform hierarchical clustering. The path of cluster_pip.R should be specified in the perl script or on the command line. The syntax for running the automatic clustering script is:

cluster_pip.pl -d dap_output_file -c cluster_number

An sample output is shown below

   chr6.6589277   0.46010  1   1.03732   38.218
   chr6.6587194   0.11850  1   1.03732   38.022
   chr6.6588881   0.10760  1   1.03732   38.022
   chr6.6585976   0.09394  1   1.03732   38.218
   chr6.6583181   0.08269  1   1.03732   38.218
   chr6.6586605   0.05558  1   1.03732   38.022
   chr6.6582862   0.04892  1   1.03732   38.022
   chr6.6582001   0.04681  1   1.03732   38.218
   chr6.6581770   0.01715  1   1.03732   37.830
   chr6.6659340   0.00603  1   1.03732    2.342
   chr6.6626263   0.99900  2   0.99900    6.125
   chr6.6643657   0.39640  3   0.99767    4.253
   chr6.6640640   0.25920  3   0.99767    2.543
   chr6.6646966   0.15940  3   0.99767    1.734
   chr6.6641035   0.10810  3   0.99767    2.412
   chr6.6644538   0.07457  3   0.99767    4.072

The first and second columns represent the SNP name and individual SNP PIP value, respectively. The 3rd and 4th columns provide cluster id and cluster-level PIPs (by summing up the SNP-level PIPs within the cluster), and the last column represents log10 Bayes factor from the single-SNP association testing. Note that as an ad-hoc algorithm, there is no guarantee that cluster-level PIPs are bounded by 1, but they are not expected to be much larger than 1 if the number of clusters specified is reasonable. This clustering algorithm does not use LD information, one should also examine the genotype LD between SNPs within each cluster to confirm the validity of the automatic clustering (also as an independent check).