3. Genome‐wide Models - GenomicSEM/GenomicSEM GitHub Wiki

Genome-wide Models without Individual SNP Effects

There are three primary steps necessary to running a structural model that does not include SNP effects. The first is to munge the summary statistics for multivariable LD-Score regression. We note here that the user can use previous munge files produced by the LD-Score regression packages, but that our package is also capable of munging the summary statistics. In the process of munging the summary statistics, it is also important to know that for case/control designs that meta-analyzed across cohorts that we recommend using the sum of effective sample sizes. See Page 2.1 of the wiki for details. Finally, we note that you must use the same ancestral background across your summary statistics AND that this ancestral background must match that of the LD-scores used for the ldsc function described below. Note that participant samples used to produce the summary statistics can range from entirely overlapping to entirely independent and you DO NOT need to know the level of overlap in order to run Genomic SEM; this is one of the major advantages of the method.

Step 1: Munge the summary statistics

The first step in running a Genomic SEM model is to munge the summary statistics. The munge function works to convert the summary statistics to the format expected by LDSC (i.e., on a z-statistic metric). The summary statistics files input into the munge function at a minimum need to contain five pieces of information:

The rsID of the SNP.
An A1 allele column, with A1 indicating the effect allele.
An A2 allele column, with A2 indicating the non-effect allele.
Either a logistic or continuous regression effect.
The p-value associated with this effect.

The package will automatically rename the column based on commonly observed names, but may return an error if the file contains trait-specific column headers (e.g., RSID_SCHIZOPHRENIA). All traits can be munged at once as in the example below. Here we use four traits: Anxiety (2019), Major Depression (2018), Alcohol use Disorder (2018), and PTSD (2017). We have already done some reformatting of the summary statistics, including examples of how to appropriately calculate the sum of effective sample size, which we review on Page 2.1 of the wiki.

The munge function takes 6 arguments:

files: The name of the summary statistics files
hm3: The name of the reference file. Here we use Hapmap 3 SNPs. This file can be obtained from https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v. Note that we previously used a reference file that removed the MHC (i.e., HLA) region, but upon further inspection found that the MHC region was also removed from the LD-score files used for estimating LD-Score regression in the following step, such that MHC is automatically excluded from analyses. You do NOT need to rerun analyses if you have been using the Hapmap3 SNPs with MHC removed as you should obtain equivalent results.
trait.names: The trait names that will be used to name the saved files
N: The sample sizes associated with the traits. Note that for binary traits that reflect a meta-analysis across multiple cohorts this should reflect the sum of effective sample sizes across contributing cohorts or the sample size column in the summary statistics should reflect the SNP-specific, sum of effective sample sizes. Here, effective sample size refers to 4v(1-v)n, where v is the sample prevalence. When inputting the sum of effective sample sizes, the sample prevalence should then be entered as 0.5 when running ldsc to reflect the fact that effective sample size already corrects for sample ascertainment. If the input contains SNP-specific sample sizes for each row, then the munge column will only use this N if the user does not provide their own.
info.filter: The INFO filter. The package default is to filter to SNPs with INFO > 0.9. If the summary statistics do not contain an INFO column the function can still be run, but results should be interpreted keeping in mind that this cleaning step was missed.
maf.filter: The MAF filter. The package default is to filter to SNPs with MAF > 0.01.

#create vector of the summary statistics files
files<-c("MDD_withNeff.txt", "SORTED_PTSD_EA9_ALL_study_specific_PCs1.txt", "ANX_withNeff.txt","ALCH_withrsID.txt")

#define the reference file being used to allign alleles across summary stats
#here we are using hapmap3
hm3<-"eur_w_ld_chr/w_hm3.snplist"

#name the traits 
trait.names<-c("MDD","PTSD","ANX", "ALCH")

#list the sample sizes. All but PTSD have SNP-specific sum of effective sample sizes so only its
#sample size is listed here
N=c(NA,5831.346,NA,NA)

#definte the imputation quality filter
info.filter=0.9

#define the MAF filter
maf.filter=0.01

#run munge
munge(files=files,hm3=hm3,trait.names=trait.names,N=N,info.filter=info.filter,maf.filter=maf.filter)

Step 2: Run multivariable LDSC

The second step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V). This is achieved by running the ldsc function. We note that it would not be appropriate in this case to take output from runs of the original LD score regression package to construct, by hand, the S and V matrices. This is because the sampling covariances that occupy the off-diagonal elements of the V matrix could not be filled in. The ldsc function takes 5 necessary arguments:

traits: a vector of file names/paths to files which point to the munged sumstats.
sample.prev: A vector of sample prevalences of length equal to the number of traits. Sample prevalence is calculated as the number of cases over the total number of participants (i.e., cases + controls). Possible range = 0-1. HOWEVER, if you have access to the sum of effective sample sizes then this should be entered for munge in the prior step and sample prevalence should be entered as 0.5. If the trait is continuous, the values should equal NA.
population.prev: A vector of population prevalences. These estimates can be obtained from a number of sources, such as large scale epidemiological studies. Possible range = 0-1. Again, if the trait is continuous the values should equal NA.

4/5. ld and wld: A folder of LD scores used as the independent variable in LDSC (ld) and LDSC weights (wld). These are typically the same folder, and in the original LD score package is called "eur_w_ld_chr". We use the same LD scores and weights for our application, though the user can supply their own if desired. Weights for the european population used here can be obtained by downloading the eur_w_ld_chr folder in the link below (Note that these are the same weights provided by the original developers of LDSC): https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v If you receive an error when running ldsc of not being able to change your working directory please be sure that you have specified the correct file path to the LD scores. Please note that the LD scores must match the ancestral background of the summary statistics provided (i.e., European summary statistics with European LD scores). It is not currently possible to run LD-score regression with ad-mixed ancestral backgrounds. The following example takes summary statistics for the p-factor and runs multivariable LD score regression.

trait.names: An optional argument specifying the trait names. This allows for model specification using the trait names in later steps, which can be useful for keeping track of results when the number of traits becomes large. If this argument is not specified, the function will automatically name the traits in the general form V1-VX.
chr: An optional argument specifying whether you are modeling genomic data from less than 22 chromosomes (e.g., for non-human populations).
n.blocks: An option arugment specifying whether you want the function to use more than the 200 blocks used to produce the block jackknife standard errors. Note that an update was made on November 2nd, 2021 so that if > 18 traits are analyzed the function will automatically use > 200 blocks in order to produce accurate estimates of model fit in subsequent analyses.
ldsc.log: An optional argument specifying how you want to name the ldsc.log file. The default is to name the log file using the file names of the munged summary statistics.
stand: An optional argument specifying whether you want the ldsc to also output the standardized S matrix (the genetic correlation matrix) and its sampling covariance matrix. Default = FALSE.
select: An optional argument specifying whether you want ldsc to estimate using only odd (select = "ODD") or even (select = "EVEN") chromosomes. This can be helpful if you want to perform exploratory analyses on odd chromosomes and confirmatory analyses on even chromosomes as a hold-out sample. It can also be set to a set of numbers, such as c(1,3,10), to run ldsc on a specific chromosome or chromosomes. Default = FALSE, in which case ldsc is estimated using all chromosomes.

#vector of munged summary statisitcs
traits<-c("MDD.sumstats.gz","PTSD.sumstats.gz","ANX.sumstats.gz", "ALCH.sumstats.gz")

#enter sample prevalence of .5 to reflect that all traits were munged using the sum of effective sample size
sample.prev<-c(.5,.5,.5,.5)

#vector of population prevalences
population.prev<-c(.15,.08,.20,.159)

#the folder of LD scores
ld<-"eur_w_ld_chr/"

#the folder of LD weights [typically the same as folder of LD scores]
wld<-"eur_w_ld_chr/"

#name the traits
trait.names<-c("MDD","PTSD","ANX","ALCH")

#run LDSC
LDSCoutput<-ldsc(traits=traits,sample.prev=sample.prev,population.prev=population.prev,ld=ld,wld=wld,trait.names=trait.names)

#optional command to save the output as a .RData file for later use
save(LDSCoutput,file="LDSCoutput.RData")

The output (named LDSCoutput here) is a list with 5 named variables in it:

LDSCoutput$S is the covariance matrix (on the liability scale for case/control designs).
LDSCoutput$V which is the sampling covariance matrix in the format expected by lavaan.
LDSCoutput$I is the matrix of LDSC intercepts and cross-trait (i.e., bivariate) intercepts.
LDSCoutput$N contains the sample sizes (N) for the heritabilities and sqrt(N1N2) for the co-heritabilities. These are the sample sizes provided in the munging process.
LDSCoutput$m is the number of SNPs used to construct the LD score.

If you are going to be running a number of Genomic SEM models it may be useful to save the output of the multivariable LDSC function as an .RData object for later use so the function need not be run again. An example of this piece of R code is shown at the end of the script above. If you want to output the standard errors of the ld-score regression in the order they are listed in the genetic covariance (i.e., S) matrix, then you can run the three lines of code below. These SEs are also listed in the .log file produced by ldsc.

k<-nrow(LDSCoutput$S)
SE<-matrix(0, k, k)
SE[lower.tri(SE,diag=TRUE)] <-sqrt(diag(LDSCoutput$V))

Common Factor Model

The third step is to run the model using either one of the pre-packaged models or a user-specified model. We currently offer a pre-packaging of a common factor model that we will apply in this case. The common factor function is called commonfactor and only takes two arguments:

covstruc: The output object from multivariable LD-Score regression.
estimation: Whether you want to use Diagonally Weighted Least Square (DWLS) or Maximum Likelihood (ML) estimation.

Below is the code to run a common factor model using the p-factor LDSC output with DWLS estimation. The GenomicSEM package automatically produces the code for the model, and uses unit variance identification (i.e., the variance of the latent factor is fixed to 1).

#To run using DWLS estimation#
CommonFactor_DWLS<- commonfactor(covstruc = LDSCoutput, estimation="DWLS")

#print CommonFactor_DWLs output#
CommonFactor_DWLS

$modelfit
      chisq df   p_chisq      AIC CFI       SRMR
df 1.283453  2 0.5263829 17.28345   1 0.03621694

$results
   lhs op  rhs Unstandardized_Estimate Unstandardized_SE Standardized_Est Standardized_SE      p_value
1   F1 =~  MDD             0.283806728        0.02100274       0.97326125      0.07202492 1.313524e-41
2   F1 =~ PTSD             0.181376110        0.03295498       0.45226835      0.08217452 3.717871e-08
3   F1 =~  ANX             0.445784774        0.03245611       0.92294074      0.06719626 6.265447e-43
4   F1 =~ ALCH             0.205225643        0.02432180       0.54969157      0.06514532 3.230092e-17
6  MDD ~~  MDD             0.004486552        0.01095668       0.05276254      0.12885241 6.821868e-01
7 PTSD ~~ PTSD             0.127932946        0.07187450       0.79545336      0.44689666 7.508430e-02
8  ANX ~~  ANX             0.034569541        0.03011034       0.14818043      0.12906629 2.509292e-01
9 ALCH ~~ ALCH             0.097270315        0.02589915       0.69783919      0.18580627 1.728339e-04

This produces an R object, CommonFactor_DWLS, that contains the two elements shown above:

CommonFactor_DWLS$modelfit prints the model chi-square, degrees of freedom, the p-value for the chi-square test, Aikake Information Criterion (AIC), Comparative Fit Index (CFI), and Standardized Root Mean Square Residual (SRMR). The p-value for the chi-square test will likely oftentimes be significant, violating the null hypothesis that the model implied covariance matrix does not significantly differ from the observed covariance matrix. However, this should not be cause for concern as this particular test is highly sensitive for well-powered estimates of the S matrix, as is often the case when S and V are derived using LDSC of GWAS data from consortia or large biobanks. The chi-square may still be useful for formally comparing competing models, so long as they are mathematically nested.
CommonFactor_DWLS$results prints the unstandardized and standardized results for the common factor model. For this example the variables are MDD = Major Depressive Disorder, Post-traumatic stress disorder = PTSD, ANX = Anxiety, ALCH = Alcohol use disorder. "F1 =~ ?" lists the indicator loadings on the common factors. "V ~~ V" lists the residual variances of the indicators after removing variance explained by the common factor. In the standardized case of a common factor model ("V ~~ V" + "F1 =~ V"^2) will sum to 1.

These results can be inserted into a path diagram to produce the following:

One thing to watch for in your model output is what is sometimes called a Heywood case. This refers to instances when the standardized factor loading exceeds 1 and, therefore, the indicator has a negative residual variance. This is inappropriate both because it is not possible, and because it will produce non-interpretable estimates of model fit (the model fit represents fit to something that could not exist in the population). In this instance, the user-specified function, outlined below, should be used to impose a model constraint to keep the residual variance above 0. We note that if you were to run the same common factor model using the usermodel function, then you would need to use the specific code directly below, where we specify unit variance identification by telling lavaan to fix the variance of the common factor to 1, and also letting lavaan that it needs to freely estimate the loading of the first, factor indicator by writing NA* before the variable name (e.g., NA*MDD). Note that you can alternatively set the std.lv argument to TRUE when running usermodel to achieve the same outcome of unit variance identification.

commonfactor.model<-'F1=~ NA*MDD + PTSD + ANX + ALCH
F1~~1*F1

User-Specified Models

Exploratory Factor Analysis

As genomic methods continue to reveal new and often surprising insights it will not always be the case that a particular genetic factor structure is theorized prior to examining a genetic covariance matrix. However, once this matrix is constructed, clear patterns of clustering may begin to emerge that warrant further investigation. For example, a heatmap of anthropometric traits indicates that there are two clusterings of early-life and obesity related traits. The code below specifies an EFA outside of Genomic SEM that is used to inform a follow-up CFA in Genomic SEM. The genetic covariance is smoothed beforehand as it is slightly non-positive definite. Summary statistics for early life traits can be obtained from: https://egg-consortium.org/. Summary statistics for obesity related traits can be obtained from: https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files.

#munge the summary statistics
munge(c("SNP_gwas_mc_merge_nogc.tbl.uniq", "GIANT_2015_WHR_COMBINED_EUR.txt", "EGG_Obesity_Meta_Analysis_1.txt", "GIANT_2015_HIP_COMBINED_EUR.txt", "GIANT_2015_WC_COMBINED_EUR.txt", "GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt", "EGG_HC_DISCOVERY.v2.txt", "EGG-GWAS-BL.txt", "EGG_BW2_DISCOVERY.txt"), hm3 = "w_hm3.snplist",trait.names=c("BMI2015", "waisthip", "childobese", "hip", "waist", "height", "headcirc", "birthlength", "birthweight"), c(NA, NA, 13848, NA, NA, NA, NA, NA, 26836), info.filter = 0.9, maf.filter = 0.01)

#run multivariable LDSC to create the S and V matrices
ld <- "eur_w_ld_chr/"
wld <- "eur_w_ld_chr/"
traits<-c("BMI2015.sumstats.gz", "waisthip.sumstats.gz", "childobese.sumstats.gz", "waist.sumstats.gz","hip.sumstats.gz", "height.sumstats.gz", "headcirc.sumstats.gz", "birthlength.sumstats.gz", "birthweight.sumstats.gz")
sample.prev <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA)
population.prev <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA)
trait.names<-c("BMI","WHR","CO","Waist", "Hip", "Height", "IHC", "BL", "BW")
anthro<-ldsc(traits, sample.prev, population.prev, ld, wld, trait.names)

#smooth the S matrix for EFA using the nearPD function in the Matrix package. 
require(Matrix)
Ssmooth<-as.matrix((nearPD(anthro$S, corr = FALSE))$mat)

#run EFA with promax rotation and 2 factors using the factanal function in the stats package
require(stats)
EFA<-factanal(covmat = Ssmooth, factors = 2, rotation = "promax")

#print the loadings
EFA$loadings
Loadings:
       Factor1 Factor2
BMI     0.917         
WHR     0.760  -0.209 
CO      0.642         
Waist   0.991         
Hip     0.843   0.202 
Height  0.114   0.557 
IHC             0.566 
BL              0.907 
BW              0.836 

               Factor1 Factor2
SS loadings      3.547   2.248
Proportion Var   0.394   0.250
Cumulative Var   0.394   0.644

Confirmatory factor analysis

The results of the exploratory factor analysis are consistent with a 2 factor model, we can proceed to fit the 2 factor model in GenomicSEM using a form of confirmatory factor analysis. The advantage of running a confirmatory factor analysis is that it allows you to consider various alternative models (single factor, 2 correlated factors, 2 uncorrelated factors) and test the difference in fit. Below we fit a 2 correlated factor model, where indicator Hip loads on both factors F1 and F2.

#Specify the Genomic confirmatory factor model
CFAofEFA <- 'F1 =~ NA*BMI + WHR + CO + Waist + Hip
             F2 =~ NA*Hip + Height + IHC + BL + BW
F1~~F2
Waist ~~ a*Waist
a > .001'

#run the model
Anthro<-usermodel(anthro, estimation = "DWLS", model = CFAofEFA, CFIcalc = TRUE, std.lv = TRUE, imp_cov = FALSE)

#print the results
Anthro

[1] "The S matrix was smoothed prior to model estimation due to a non-positive definite matrix. The largest absolute difference in a cell between the smoothed and non-smoothed matrix was  0.000128450830733748 As a result of the smoothing, the largest Z-statistic change for the genetic covariances was  0.021512453008981 . We recommend setting the smooth_check argument to true if you are going to run a multivariate GWAS."

$modelfit
     chisq df p_chisq     AIC       CFI       SRMR
df 16014.2 25       0 16054.2 0.9578571 0.09489913

$results
      lhs op    rhs Unstand_Est          Unstand_SE STD_Genotype    STD_Genotype_SE      STD_All               p_value
1      F1 =~    BMI 0.338382741 0.00816779834317392 0.9537240588   0.02304977177952 0.9537241890              < 5e-300
2      F1 =~    WHR 0.167082768 0.00936778682929158 0.5602247492 0.0314034629617733 0.5602247517  3.72111467925339e-71
3      F1 =~     CO 0.447887819  0.0244242528374094 0.7083270589 0.0386406556861751 0.7083282563  4.12947640362578e-75
4      F1 =~  Waist 0.355603761 0.00862639544536458 1.0122127530   0.02444892860335 0.9995123850              < 5e-300
5      F1 =~    Hip 0.290344161 0.00943076094054226 0.7920112282 0.0257405341654102 0.7920113377 3.92076924816146e-208
6      F1 ~~     F1 1.000000000                     1.0000000000                    1.0000000000                  <NA>
7      F2 =~    Hip 0.151966783  0.0107884296650292 0.4146180238 0.0294438308532194 0.4146180811  4.62472746001644e-45
8      F2 =~ Height 0.377158724  0.0286329863341952 0.6418055680 0.0487323641256646 0.6418055791  1.26887799652867e-39
9      F2 =~    IHC 0.280118876  0.0367784521006306 0.5828681883 0.0765289575252805 0.5828670778  2.60877410484911e-14
10     F2 =~     BL 0.310782930  0.0270723443609925 0.7665143838 0.0667845847317848 0.7665146320  1.66820100019102e-30
11     F2 =~     BW 0.232900495   0.024723730383786 0.6895875218 0.0732166969417796 0.6895879134  4.50574880728176e-21
12     F2 ~~     F1 0.107437920  0.0321430808598458 0.1076180078 0.0321299434317587 0.1076180078  0.000830304106282924
13     F2 ~~     F2 1.000000000                     1.0000000000                    1.0000000000                  <NA>
14    BMI ~~    BMI 0.011181701 0.00213933204595532 0.0904101466 0.0169827743472532 0.0904101713  1.72538506415052e-07
15    WHR ~~    WHR 0.061024102 0.00509089305871616 0.6861482215 0.0572271089788417 0.6861482276  4.16116024398877e-33
16     CO ~~     CO 0.198742754  0.0428835116775018 0.4982693967  0.107377945463358 0.4982710813  3.57836757886072e-06
17  Waist ~~  Waist 0.000999107 0.00128351056380159 0.0009999272   0.01040864070421 0.0009749922     0.436322949988614
18    Hip ~~    Hip 0.017397494 0.00349411236139898 0.1301301710  0.026011239755197 0.1301302070  6.38845359891692e-07
19 Height ~~ Height 0.203091450  0.0224000658583501 0.5880855781 0.0648787840700999 0.5880855986  1.22838370118893e-19
20    IHC ~~    IHC 0.152542817  0.0425275759516715 0.6602684856  0.184102405008418 0.6602659696  0.000334612835958487
21     BL ~~     BL 0.067788689  0.0216857087319307 0.4124550518  0.131946786908232 0.4124553189   0.00177224791927804
22     BW ~~     BW 0.059815862  0.0177417036505079 0.5244679139  0.155562634680387 0.5244685097  0.000747644996784941

The code above runs multivariable LDSC, and then uses the factanal function to run an EFA on the smoothed S matrix. The factor loadings from this EFA indicate a two factor model, with a cross-loading of hip circumference on both factors. This is then used to construct a follow-up CFA model using Genomic SEM. In this example, waist circumference was originally estimated to have a negative residual variance (i.e., a Heywood case). In order to prevent this residual variance from being estimated below 0, we first write "Waist~~a*Waist". The "a*" portion names the residual variance parameter so it can be used in the model constraint portion later: "a > .001". The code overall then specifies a correlated factors model with unit variance identification, names the Waist residual variance parameter label for Waist circumference as "a", and then constrains the parameter "a" to be above 0. These results are supportive of two common genetic factors measuring early life and obesity traits that are moderately correlated (rg = .11).

These results can be inserted into a path diagram to produce the following:

Smoothing Non-positive Definite Matrices in Genomic SEM

In certain cases, the S or V matrix may be non-positive definite. The GenomicSEM package automatically checks for these cases, and will smooth to positive definite using the nearPD function should they arise. If either of your matrices are smoothed, the package will print the largest difference between the smoothed and non-smoothed matrix along with the consequent differences in Z-statistic pre- and post-smoothing. For anthropometrics the difference in observed covariances and Z-statistics were both relatively minimal. If a difference > .025 arises (arbitrarily chosen), it may be cause for concern and Genomic SEM will print a warning message. Results should be interpreted with caution, and if these same variables are being used to run a multivariate GWAS we recommend setting the smooth_check argument to TRUE for the GWAS functions. We note that the need for smoothing often occurs when low powered traits are included in the model. As the usermodel function automatically excludes any variables unlisted in the model from the S and V matrices prior to smoothing, you may also consider re-running the model without the lower powered traits to see if this fixes the issue.

Unit Identification

We note that for user specified models with latent factors, some form of unit identification is needed. In the p-factor and anthropometric traits example we use unit variance identification by constraining the variance of the latent factors to be 1. This can be done either by setting std.lv = TRUE, which standardizes all latent variables, as we do here for anthropometric traits. It can also be explicitly included in the model syntax (e.g., "F1~~1*F1), but note that the user must also either write "NA*" in front of the first factor indicator so that the model knows to freely estimate this parameter. If this argument is not provided, then the model automatically specifies unit loading identification by fixing the loading of the first indicator to 1.

A Note on Standardized Output in Genomic SEM

The Genomic SEM usermodel function produces two forms of standardized output. The first, reflected in the STD_Genotype and STD_Genotype_SE columns, is produced by re-running the specified model using standardized input (i.e., a genetic correlation and sampling correlation matrix). These results will, therefore, always be standardized with respect to the genetic variance in your phenotypes. However, these results will not be fully standardized if the latent factor is not itself specified to have a variance of 1.0. This specification is not straightforward if the latent variable is an endogenous variable in the model (i.e., it is itself an outcome). In this case, you will likely want to use the STD_All column to obtain fully standardized output. The STD_All output is obtained by taking the fully standardized output directly from lavaan, which will also standardize endogenous latent variables. Standard errors are not provided for this column as the necessary components to obtain sandwich corrected standard errors are not currently available. For the anthropometric traits example above, the latent variables are exogenous (not outcomes) and specified to have variances of 1.0, such that the STD_All and STD_Genotype column closely match one another, with the slight differences attributed to variability in model estimation. We note that the commonfactor function only produces a single standardized estimate because the STD_All and STD_Genotype column would match as above.