6. Stratified Genomic SEM - GenomicSEM/GenomicSEM GitHub Wiki

Stratified Genomic SEM

There are three primary steps necessary to estimate enrichment of parameters in a Genomic SEM framework:

  1. Select and munge the summary statistics using the munge function.
  2. Select the functional annotations of interest and run multivariable Stratified LD-score regression (S-LDSC) using the s_ldsc function.
  3. Specify the model and estimate enrichment for a particular model parameter using the enrich function.

Step 1: Munge Summary Statistics

For a description and example of how to munge summary statistics, please refer to the Models without individual SNP Effects page. For this example, we use the same five summary statistics (bipolar, anxiety, PTSD, schizophrenia, major depression) used for the p-factor example from that same wiki page. Note that we do not substantively endorse this particular model, are using it as an example to maintain consistency across the wiki examples. The remainder of this page walks through steps 2 and 3.

Step 2: Run Multivariable Stratified LD-Score Regression

The second step in Stratified Genomic SEM is to run multivariable S-LDSC using the s_ldsc function in order to obtain zero-order genetic covariance (S0) matrices, and corresponding sampling covariance matrices (V0), partitioned across the chosen functional annotations. Functional annotations refer to genetic variants grouped together based on some shared characteristic, such as where these variants are expressed, their allele frequency, etc.

Functional annotations can be created using steps outlined by the original developers of LDSC on their GitHub (https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial). It is not necessary that you create your own annotations as a wide range of pre-created annotations are available through the same research group at: https://data.broadinstitute.org/alkesgroup/LDSCORE/. For the current example, we use their most recent set of 97 baseline annotations from the file: 1000G_Phase3_baselineLD_v2.2_ldscores.tgz.

It is also necessary to provide a folder of non-partitioned LD scores used for regression weights along with a frequency folder used to restrict to SNPs with MAF > 5%. Note that the annotation, non-partitioned LD score, and frequency folders will often contain multiple files split by across chromosomes. For this reason, you are providing the directory to the folder of files, and not an individual file. From the same link used to obtain the baseline annotations, we use the 1000G_Phase3_weights_hm3_no_MHC.tgz. folder of regression weights files and the 1000G_Phase3_frq.tgz folder of allele frequency files for restricting to MAF > 5% SNPs.

The s_ldsc function takes 6 necessary arguments and 4 optional arguments:

  1. traits: a vector of file names/paths to files which point to the munged sumstats.

  2. sample.prev: A vector of sample prevalences of length equal to the number of traits. Sample prevalence is calculated as the number of cases over the total number of participants (i.e., cases + controls). Possible range = 0-1. HOWEVER, if you have access to the sum of effective sample sizes then this should be entered for munge in the prior step and sample prevalence should be entered as 0.5. If the trait is continuous, the values should equal NA.

  3. population.prev: A vector of population prevalences. These estimates can be obtained from a number of sources, such as large scale epidemiological studies. Possible range = 0-1. Again, if the trait is continuous the values should equal NA.

  4. ld: A folder of partitioned LD scores used as the independent variable in S-LDSC. Multiple annotations, or folders of annotations, can be specified simultaneously but we strongly recommend that a baseline set of annotations always be included. Here we use: 1000G_Phase3_baselineLD_v2.2_ldscores/

  5. wld: A folder of non-partitioned LD scores used as regression weights. Here we use: 1000G_Phase3_weights_hm3_no_MHC/

  6. frq: A folder of allele frequency files used to restrict to MAF > 5%. Here we use: 1000G_Phase3_frq/

  7. trait.names: An optional argument specifying the trait names. This allows for model specification using the trait names in later steps, which can be useful for keeping track of results when the number of traits becomes large. If this argument is not specified, the function will automatically name the traits in the general form V1-VX.

  8. n.blocks: An optional argument specifying number of blocks used for the jackknife resampling procedure used to obtain standard errors. Default is set to 200.

  9. ldsc.log: An optional argument specifying how to name the .log file produced by the function.

  10. exclude_cont: An optional argument specifying whether continuous annotations should be excluded when estimating S-LDSC. The default is set to TRUE as continuous annotations require special consideration.

#specify the arguments
traits <- c("scz.sumstats.gz", "bip.sumstats.gz", "mdd.sumstats.gz", "ptsd.sumstats.gz", "anx.sumstats.gz")
sample.prev <- c(.39,.45,.35,.25,.32)
population.prev <- c(.01,.01,.16,.08,.29)
ld=c("1000G_Phase3_baselineLD_v2.2_ldscores/")
wld="1000G_Phase3_weights_hm3_no_MHC/"
frq="1000G_Phase3_frq/"
trait.names<-c("SCZ", "BIP", "MDD", "PTSD", "ANX")

#run multivariable s_ldsc
S_LDSCOutput<-s_ldsc(traits=traits,
               sample.prev=sample.prev,
               population.prev=population.prev ,
               ld=ld,
               wld=wld,
               frq=frq,
               trait.names = trait.names)

#save the output as an .RData object
save(S_LDSCOutput, file = “S_PFactor.RData”)

The output (named S_LDSCOutput here) is a list object that contains:

  1. S_LDSCOutput$S : the zero-order genetic covariance matrices for each annotation.
  2. S_LDSCOutput$V: the zero-order sampling covariance matrices for each annotation.
  3. S_LDSCOutput$S_Tau: the tau matrices for each annotation. The tau matrices differ from the zero-order matrices in that they control for overlap across the annotations. As the tau matrices require extremely large sample sizes, the default behavior of the enrich function is to use the zero-order matrices as input.
  4. S_LDSCOutput $V_Tau: the tau sampling covariance matrices for each annotation.
  5. S_LDSCOutput$I: the matrix of LDSC intercepts and cross-trait (i.e., bivariate) intercepts.
  6. S_LDSCOutput $N: the sample sizes (N) for the heritabilities and sqrt(N1N2) for the co-heritabilities.
  7. S_LDSCOutput$m: the number of SNPs used to construct the LD scores for each annotation.
  8. S_LDSCOutput$Prop; the proportional size of each annotation relative to the annotation containing all SNPs.
  9. S_LDSCOutput$Select: a data.frame that codes flanking window and continuous annotations as 2 and all other annotations as 1. This is used by the enrich function to exclude the flanking window and continuous annotations from enrichment estimates.

If you are going to be running a number of Stratified Genomic SEM models it may be useful to save the output of the multivariable LDSC function as an .RData object for later use so the function need not be run again. An example of this piece of R code is shown at the end of the script above. Running s_ldsc will also produce a .log file comparable to the .log file for regular ldsc.

Step 3: Estimate Enrichment of a Model Parameter

The enrich function works by estimating the model in the zero-order genetic covariance matrix from the genome-wide annotation (i.e., the annotation containing all SNPs), fixing all estimated regression paths (including factor loadings) from the genome-wide model, except for those specified as being under investigation for enrichment, and then estimating the model with these fixed paths in each annotation. This collectively produces the necessary components to estimate the so-called enrichment ratio. More specifically, point estimates and standard errors are obtained for a particular model parameter within functional annotations that is first divided by the point estimate in the genome-wide annotation, and then divided again by the proportion of SNPs within that annotation. The null for enrichment is 1. Note that enrichment will not be estimated for flanking window or continuous annotations and that these are removed from the output by default.

The enrich function takes three necessary arguments and seven optional arguments:

  1. s_covstruc: The output from the s_ldsc function.

  2. model: The overall model specified using lavaan syntax (see Models without SNP effects page for more details on lavaan syntax).

  3. params: The parameter(s) being estimated for enrichment, specified using lavaan syntax. When multiple parameters are simultaneously examined for enrichment (e.g., enrichment of multiple factors in a correlated factors model) the output is a list object, with each list containing the output for a single parameter. Here we examine enrichment of the variance of the common factor.

  4. fix: The types of parameters you want to be fixed from the model estimated in the genome-wide annotation. The default behavior is fixed = “regressions”, which will fix all regression estimates (including factor loadings), and freely estimate covariances and (residual) variances. This is the most appropriate setting if you are examining enrichment of latent factors. Other options include fixed = “covariances” or fixed = “variances”.

  5. std.lv: An optional argument specifying whether the user wants to use unit variance identification for any latent factors in the overall model (default = FALSE). This has the advantage of not having to manually fix the variance of latent variables to 1 in the model syntax. Note that if the parameter of interest is the enrichment of the variance of a latent variable, this will still be freely estimated within the functional annotations even when std.lv = TRUE.

  6. rm.flank: An optional argument specifying whether you want to automatically remove flanking window and continuous annotations from the output (default = TRUE).

  7. tau: An optional argument specifying whether you want to use the tau matrices instead of the zero-order matrices for enrichment estimates (default = FALSE).

  8. base: An optional argument specifying whether you want to include the full estimates from the baseline model in a separate list object included in the output (default = TRUE). This can be useful to inspect in order to determine whether the model estimated in the genome-wide annotation produced sensible results. This output includes a Fixed_Enrich column that clarifies what estimates were fixed vs. freely estimated when estimating the model in all annotations.

  9. toler: An optional argument specifying whether you want to manually set the tolerance used for matrix inversion. This is typically only necessary if you receive an error message about inversion of one of the matrices.

  10. fixparam: An optional argument specifying whether specific parameters should be fixed when estimating the model within annotations. For example, the default is for the program to freely estimate (residual) variances of variables even when those parameters may be fixed in your model. If you want these parameters to continue to be fixed across all annotations you would list that parameter for this argument.

#set s_covstruc argument to output from s_ldsc function 
s_covstruc<-S_LDSCOutput

#specify the model syntax for a common factor model
model<-"F1=~SCZ+BIP+MDD+PTSD+ANX
SCZ~~a*SCZ
a > .0001
BIP~~b*BIP
b > .0001
MDD~~c*MDD
c > .0001
PTSD~~d*PTSD
d > .0001
ANX~~e*ANX
e > .0001"

#specify the F1 factor variance as the parameter of interest
params<-c(“F1~~F1”)

#use unit variance identification
std.lv=TRUE

#estimate enrichment using the enrich function
penrich<-enrich(s_covstruc=s_covstruc,model=model,params=params,std.lv=std.lv)

As an initial run of the model produced warnings for some annotations that observed variable variances were estimated as negative, the constraints were added in the model syntax above for the residuals of each observed variable (e.g., ANX~~e*ANX; e > .0001). Running the above code creates an R object called penrich, the first 10 rows of which are shown directly below:

The enrichment and enrichment_SE column list the enrichment estimates and corresponding standard error. The enrichment_p_value column lists the 1-tailed p-value of (Enrichment-1)/SE. The error and warning columns list any errors and warnings raised by lavaan when running the model.

The Cov_Smooth and Z_smooth columns list the largest difference in pre- and post-smoothed values for the genetic covariance matrix and corresponding difference in Z-statistics, respectively. Particularly when including low powered traits or small annotations, there will likely be some amount of smoothing that occurs. You may consider removing these low powered traits from your model; this does not require re-running s_ldsc for a subset of traits as the enrich function will automatically remove any traits not named in your model. Alternatively, we recommend interpreting with extreme caution any annotations with a Z_smooth value > 1. As can be seen above in the first 10 rows of output, many of the annotations required smoothing that produced very large Z_smooth discrepancies. As PTSD and ANX were the two least powered traits in this example, we try removing these traits from the model and re-examine results in the code below.

#specify the modified model syntax removing PTSD and ANX
#all other arguments are the same as before
model2<-"F1=~SCZ+BIP+MDD
SCZ~~a*SCZ
a > .0001
BIP~~b*BIP
b > .0001
MDD~~c*MDD
c > .0001"

#estimate enrichment
penrich_sub<-enrich(s_covstruc=s_covstruc,model=model2,params=params,std.lv=std.lv)

This first 10 rows for penrich_sub are shown below:

While some smoothing still occurs, it is clear that removing the two least powered traits results in many annotations not being smoothed at all, and that the overall degree of smoothing is greatly reduced. However, there are still some annotations that may be removed from consideration due to the degree of smoothing (e.g., CTCF_Hoffman in row 4). This generally highlights the importance of choosing well powered traits for Stratified Genomic SEM analyses and S-LDSC more generally.

Optional Estimation of Full Model in a Specific Annotation

In some cases, you may be interested in estimating a model within a specific annotation. This would allow you to obtain point estimates, standard errors, and model fit indices for a given annotation, as opposed to an enrichment estimate like can be obtained from the enrich function outlined directly above. This can be done by saving the V and S matrix for a particular annotation into a list object, and then using that newly created list object as input to the usermodel function. Below we demonstrate how this would be done for the genome-wide annotation using the same model estimated for enrichment.

#create list object with V and S for genome-wide attention (called baseL2)
covstruc<-list(V_LD=s_covstruc$V$baseL2,S_LD=s_covstruc$S$baseL2)

#specify the modified model syntax removing PTSD and ANX
model2<-"F1=~SCZ+BIP+MDD
SCZ~~a*SCZ
a > .0001
BIP~~b*BIP
b > .0001
MDD~~c*MDD
c > .0001"

#estimate model
pfactor<-usermodel(covstruc=covstruc,model=model2,std.lv=TRUE)