Statistical test in snpsettest - HimesGroup/snpsettest GitHub Wiki
For set-based association tests, the snpsettest package employed the
statistical model described in VEGAS (versatile gene-based
association study) [1], which takes as input variant-level p
values and reference likage disequilibrium (LD) data. Briefly, the test
statistics is defined as the sum of squared variant-level Z-statistics.
Letting a set of scores
of individual SNPs
for
within a set
, the test
statistic
is
Here,
is a vector of multivariate normal distribution with a mean vector
and a
covariance matrix
in
which
represents LD among SNPs. To test a set-level association, we need to
evaluate the distribution of
. VEGAS uses
Monte Carlo simulations to approximate the distribution of
(directly
simulate
from
multivariate normal distribution), and thus, compute a set-level p
value. However, its use is hampered in practice when set-based p values
are very small because the number of simulations required to obtain such
p values is be very large. The snpsettest package utilizes a
different approach to evaluate the distribution of
more
efficiently.
Let
(instead of
,
we could use any decomposition that satisfies
with a
non-singular matrix
such that
).
Then,
Now, we posit
so that
and express the test statistic
as a quadratic
form:
With the spectral theorem,
can
be decomposed as follow:
where is an orthogonal
matrix. If we set
,
is a vector of
independent standard normal variable
since
Under the null hypothesis,
is assumed to
be
.
Hence,
where
.
Thus, the null distribution of
is a linear
combination of independent chi-square variables
(i.e., central quadratic form in independent normal variables). For
computing a probability with a scalar
,
several methods have been proposed, such as numerical inversion of the characteristic function [2]. The snpsettest package uses the algorithm of Davies [3] or saddlepoint approximation [4] to obtain set-based p values.
References
-
Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am J Hum Genet. 2010 Jul 9;87(1):139–45.
-
Duchesne P, De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.
-
Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of Chi-square Random Variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.
-
Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika. 1999;86(4):929–35.