5. Multivariate GWAS - GenomicSEM/GenomicSEM GitHub Wiki

GenomicSEM for User Specified Models with SNP Effects (Multivariate GWAS)

A powerful extension of Genomic SEM is to run user specified models that include SNP effects. This allows for the specification of a range of multivariate GWAS models. For example, a correlated factors model could be specified in which the SNP is specified to predict both factors. Estimation of a user specified model with SNP effects proceeds in 4 steps. The first three steps mirror that for commonfactorGWAS. However, we outline the steps again as we apply it to a second set of neuroticism traits for this example, which takes different arguments in the initial steps. Note that estimation of the GWAS models are all entirely independent of one another. This means that the summary statistics used as input to the multivariate GWAS functions can be arbitrarily split up and submitted to different jobs on a computing cluster to help with memory and run-time issues. Please also note that the sumstats function requires that the variables be listed in the same order as they are listed for the ldsc function.

Step 1: Munge the summary statistics

As in the non-GWAS example, and the common factor GWAS example, the summary statistics need to be munged in order to produce the S (genetic covariance) and V (sampling covariance) matrix that do not include the SNPs. All traits can be munged at once as in the example below. In order to use our package to prepare the summary statistics the munge function can be used to convert the summary statistics to the format expected by LDSC. The summary statistics files input into the munge function at a minimum need to contain five pieces of information:

  1. The rsid of the SNP.
  2. An A1 allele column, with A1 indicating the effect allele.
  3. An A2 allele column, with A2 indicating the non-effect allele.
  4. Either a logistic or continuous regression effect.
  5. The p-value associated with this effect.

The package will automatically rename the column based on commonly observed names, but may return an error if the file contains trait-specific column headers (e.g., RSID_SCHIZOPHRENIA). All traits can be munged at once as in the example below. Here we use 12 traits as indicators of neuroticism from round 1 of the Neale Lab UK Biobank GWAS results. The round 1 summary statistics are available for download from: http://www.nealelab.is/uk-biobank. Note that for these specific traits, the files do not include A1 and A2 columns separately, but are listed in conjunction with base pair and chromosome in the variant column. However, the summary statistics can be read into most software and A1/A2 readily subset out from this first column.

munge(c("mood.txt", "misery.txt", "irrit.txt", "hurt.txt", "fedup.txt", "nervous.txt", "worry.txt", "tense.txt", "embarras.txt", "nerves.txt", "lonely.txt", "guilt.txt"),hm3 = "w_hm3.snplist",trait.names=c("mood", "misery", "irrit", "hurt", "fedup", "nervous", "worry", "tense", "embarass", "nerves", "lonely", "guilt"), c(329428,331856,322668,327832,330549,328725,328717,327232,323766,325248,332263,328769), info.filter = 0.9, maf.filter = 0.01)

The munge function itself takes 6 arguments:

  1. files: The name of the summary statistics files

  2. hm3: The name of the reference file. Here we use Hapmap 3 SNPs. This file can be obtained from https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v. Note that we previously used a reference file that removed the MHC (i.e., HLA) region, but upon further inspection found that the MHC region was also removed from the LD-score files used for estimating LD-Score regression in the following step, such that MHC is automatically excluded from analyses. You do NOT need to rerun analyses if you have been using the Hapmap3 SNPs with MHC removed as you should obtain equivalent results.

  3. trait.names: The trait names that will be used to name the saved files

  4. N: The sample sizes associated with the traits. Note that this should reflect the sum of effective sample sizes across contributing cohorts for meta-analyses of case/control traits or the sample size column in the summary statistics should reflect the SNP-specific, sum of effective sample sizes. When inputting the sum of effective sample sizes, the sample prevalence should then be entered as 0.5 when running ldsc to reflect the fact that effective sample size already corrects for sample ascertainment. If the input contains SNP-specific sample sizes for each row, then the munge function will only use this N if the user does not provide their own. Here we input the total sample size as these reflect GWAS of case/control outcomes within a single cohort (UK Biobank).

  5. info.filter: The INFO filter. The package default is to filter to SNPs with INFO > 0.9. If the summary statistics do not contain an INFO column the function can still be run, but results should be interpreted keeping in mind that this cleaning step was missed.

  6. maf.filter: The MAF filter. The package default is to filter to SNPs with MAF > 0.01.

We note that our munge function will produce slightly different results, generally with less SNPs, then the munge function from the original LDSC package. This is because in addition to checking whether or not the A1 allele matches A1/A2 in the ref file, our munge function also checks whether A2 matches A1/A2 in the ref file. In addition, our munge function removes SNP with missing MAFs in the summary statistics file. Munge will produce a .log file that should be inspected to examine whether column names are being interpreted correctly by the function.

Step 2: Run multivariable LDSC

The second step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V). This is achieved by running the ldsc function. We note that it would not be appropriate in this case to take output from runs of the original LD score regression to construct by hand the S and V matrix. This is because the sampling covariances that occupy the off-diagonal elements of the V matrix could not be filled in. The ldsc function takes the first 5 necessary arguments and some additional optional arguments:

  1. traits: a vector of file names/paths to files which point to the munged sumstats.

  2. sample.prev: A vector of sample prevalences of length equal to the number of traits. Sample prevalence is calculated as the number of cases over the total number of participants (i.e., cases + controls). Possible range = 0-1. HOWEVER, if you have access to the sum of effective sample sizes then this should be entered for munge in the prior step and sample prevalence should be entered as 0.5. If the trait is continuous, the values should equal NA.

  3. population.prev: A vector of population prevalences. These estimates can be obtained from a number of sources, such as large scale epidemiological studies. Possible range = 0-1. Again, if the trait is continuous the values should equal NA.

4/5. ld and wld: A folder of LD scores used as the independent variable in LDSC (ld) and LDSC weights (wld). These are typically the same folder, and in the original LD score package is called "eur_w_ld_chr". We use the same LD scores and weights for our application, though the user can supply their own if desired. Weights for the european population used here can be obtained from by downloading the eur_w_ld_chr folder (note that these are the same weights provided by the original developers of LDSC): https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v If you receive an error when running ldsc of not being able to change your working directory please be sure that you have specified the correct file path to the LD scores. Please note that the LD scores must match the ancestral background of the summary statistics provided (i.e., European summary statistics with European LD scores). It is not currently possible to run LD-score regression with ad-mixed ancestral backgrounds. The following example takes summary statistics for the p-factor and runs multivariable LD score regression.

  1. trait.names: An optional argument specifying the trait names. This allows for model specification using the trait names in later steps, which can be useful for keeping track of results when the number of traits becomes large. If this argument is not specified, the function will automatically name the traits in the general form V1-VX.

  2. chr: An optional argument specifying whether you are modeling genomic data from less than 22 chromosomes (e.g., for non-human populations).

  3. n.blocks: An option arugment specifying whether you want the function to use more than the 200 blocks used to produce the block jackknife standard errors. Note that an update was made on November 2nd, 2021 so that if > 18 traits are analyzed the function will automatically use > 200 blocks in order to produce accurate estimates of model chi-square.

  4. ldsc.log: An optional argument specifying how you want to name the ldsc.log file. The default is to name the log file using the file names of the munged summary statistics.

  5. stand: An optional argument specifying whether you want the ldsc to also output the genetic correlation matrix and its sampling covariance matrix. Default = FALSE.

  6. select: An optional argument specifying whether you want ldsc to estimate using only odd (select = "ODD") or even (select = "EVEN") chromosomes. This can be helpful if you want to perform exploratory analyses on odd chromosomes and confirmatory analyses on even chromosomes as a hold-out sample. It can also be set to a set of numbers, such as c(1,3,10), to run ldsc on a specific chromosome or chromosomes. Default = FALSE, in which case ldsc is estimated using all chromosomes.

traits <- c("mood.sumstats.gz","misery.sumstats.gz", "irrit.sumstats.gz","hurt.sumstats.gz", "fedup.sumstats.gz", "nervous.sumstats.gz", "worry.sumstats.gz", "tense.sumstats.gz", "embarass.sumstats.gz", "nerves.sumstats.gz", "lonely.sumstats.gz", "guilt.sumstats.gz")
sample.prev <- c(0.451,0.427,0.28,0.556,.406,.237,.568,.171,.478,.213,.177,.283)
population.prev <- c(0.451,0.427,0.28,0.556,.406,.237,.568,.171,.478,.213,.177,.283)
ld <- "eur_w_ld_chr/"
wld <- "eur_w_ld_chr/"
trait.names<-c("mood", "misery", "irritability", "hurt", "fedup", "nervous", "worry", "tense", "embarass", "nerves", "lonely", "guilt")
LDSCoutput <- ldsc(traits=traits, sample.prev=sample.prev, population.prev=population.prev, ld=ld, wld=wld, trait.names=trait.names)

Step 3: Prepare the summary statistics for GWAS

The sumstats function takes care of a few things prior to running multivariate GWAS. In order to run a multivariate GWAS, the SNPs, and corresponding SNP effects, need to be coded across phenotypes so that the same allele is the reference allele in all cases. In all cases, coefficients and their SEs are then further transformed such that they are scaled relative to unit-variance scaled phenotypes. In order to appropriately perform this standardization for different types of betas (e.g., logistic betas, odds-ratios), the sumstats function takes a number of arguments outlined below. The sumstats function then combines these newly scaled betas and SEs into a single, listwise deleted file. Note that it is possible to back out a beta and SE when only Z-statistics are provided as long as you have the total sample size for a continuous trait or the sum of effective sample sizes across the contributing cohorts for a binary trait. Confused about how to specify the arguments for this function? See point 3 on the important resources and key information page. Please also note that the sumstats function requires that the variables be listed in the same order as they are listed for the ldsc function above. The sumstats function takes 4 necessary arguments, along with 6 additional arguments whose default is NULL:

  1. files: The name of the summary statistics files. This should be the same as the name of the files used for the munge function in Step 1 above and the files should be in the same listed order used for the ldsc function in step 2.

  2. ref: The reference file used to calculate SNP variance across traits. We use 1000 genomes phase 3 that is available for download here: https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v.

  3. trait.names: The names of the traits in the order that they are listed for the files.

  4. se.logit: Whether the SEs are on a logistic scale. Please note that for many traits, the SEs will be reported on a logistic scale even when the regression effects are reported as odds ratios (ORs). This information can usually be found in the README file for the summary statistics. If this is listed as FALSE, and all other additional arguments are FALSE, then the function is expecting the standard error of the OR. The function will automatically determine whether the effect column is an OR or on a logistic scale based on the median effect size.

  5. OLS: Whether the phenotype was a continuous outcome analyzed using an observed least square (OLS; i.e., linear) estimator. If no information is provided the default is for the function to assume FALSE. We write "OLS=NULL" for this argument, and all other arguments whose default is NULL, so it is clear what the argument names are when needed; however, this need not be written out if the argument is in fact FALSE for all of your traits.

  6. linprob: Whether the phenotype was a dichotomous outcome for which there are only Z-statistics in the summary statistics file -or- it was a dichotomous outcome analyzed using an OLS estimator, as is the case UKB phenotypes such as the 12 used here that were analyzed using the Hail software. In this case, a form of standardization is performed where logistic betas are backed out from a signed effect column. If no information is provided the default is for the function to assume NULL.

  7. N: A user provided N listed in the order the traits are listed for the files argument. If no information is provided, the default is for the function to assume NULL. When backing out a logistic beta using the linprob argument this requires the sum of effective sample sizes across the cohorts contributing to the GWAS. For OLS transformations this should be the total sample size. If sample sizes are being provided for some traits, but not others, then NA can be used for traits that the user does not wish to provide sample sizes for. If the summary statistics file includes a sample size column, then the user can also list NA if they wish to use the SNP-specific sample sizes to perform the rescaling. However ,we note again that this should be the sum of effective sample sizes for dichotomous traits. For this particular example, all traits are binary and were analyzed using a linear function so we set linprob to TRUE and supply the effective sample size within the single UKB cohort. (Note that this example still needs to be updated to reflect the updated sample size recommendations).

  8. betas: This argument can be used to directly supply the column name in the GWAS summary stats of the betas for continuous traits if the user knows the GWAS was run on an an already standardized phenotype. In this case, the package will directly use these betas and will not perform any further transformations beyond aligning the sign to a reference allele. The main reason a user might use this argument is if MAF is not available in the summary statistics file for a continuous trait. More specifically, when in-sample MAF is not available, the function uses reference MAF to back out a partially standardized beta, and mismatch between in-sample and reference MAF may cause some minimal amount of bias in the relative scaling of the betas across summary statistics.

  9. info.filter: The INFO filter to use. The package default is 0.6.

  10. maf.filter: The MAF filter to use. The package default is 0.01.

  11. keep.indel: Whether insertion deletions (indels) should be included in the output. Default = FALSE.

  12. parallel: Whether the function should be run in parallel. Default = FALSE.

  13. cores: When running in parallel, whether you want the computer to use a certain number of cores. If parallel is set to true, default is to use either 1 less than the total number of cores available or when the number of available cores exceeds the number of traits to use as many cores as there are traits.

N<-c(326275, 324813, 260085, 323690, 318807, 237478, 322720, 185698, 323113, 218257, 193453, 266678)
se.logit <-c(F,F,F,F,F,F,F,F,F,F,F,F)
Hail<-c(T,T,T,T,T,T,T,T,T,T,T,T)
files <- list("mood.txt", "misery.txt", "irrit.txt", "hurt.txt", "fedup.txt", "nervous.txt", "worry.txt", "tense.txt", "embarras.txt", "nerves.txt", "lonely.txt", "guilt.txt")
ref <- "reference.1000G.maf.0.005.txt" 

neuroticism_sumstats <- sumstats(files=files,ref=ref,trait.names=c("mood", "misery", "irrit", "hurt", "fedup", "nervous", "worry", "ten", "embarras", "nerves", "lonely", "guilt"),se.logit=se.logit,OLS = NULL,linprob=Hail,N=N,betas=NULL, info.filter=.6, maf.filter=0.01,keep.indel=FALSE,parallel=TRUE,cores=NULL)

Step 4: Run the Multivariate GWAS

The userGWAS function can be used to specify any model in which the SNP is a parameter in the model. The new default behavior for this function is to fix the measurement model (i.e., the factor loadings and factor correlations) across all SNPs. This decreases run times and improves interpretability of the genome-wide results. The function will also now automatically calculate Q_SNP for each latent factor that is predicted by a SNP. Q_SNP is a chi-square distributed heterogeneity index that quantifies the degree to which the univariate GWAS results for the factor indicators (the traits that define the factor) deviate from the pattern implied by the factor loadings. Q_SNP can serve as a quality control metric to remove SNPs that do not operate via the factor and can also reflect an interesting result in its own right if the researcher is also interested in identifying trait-specific SNPs relative to the other traits that load on a factor. The various arguments for userGWAS are detailed below. Note that only covstruc, SNPs, and model are necessary arguments.

  1. covstruc: The output from LDSC.

  2. SNPs: The output from sumstats.

  3. estimation: Whether the models should be estimated using DWLS or ML estimation. The package default is DWLS.

  4. model: The model that is being estimated (written in lavaan syntax)

  5. printwarn: An optional argument specifying whether the user wants individual lavaan warnings and errors to be included in the output. As these warnings can take-up considerable memory, we give the user the option to not save the warnings by specifying false.

  6. sub: An optional argument specifying whether or not the user is requesting only specific components of the model output to be saved. As it is unclear what piece of the model the user is most interested in, the default is for the function to return all pieces of model output in separate list objects for each SNP. However, this can take-up considerable memory, and it is likely the case that only components of the model output are of interest. For example, in this case we are interested in the SNP effect on each of the two factors, so we write: sub = c("F1~SNP", "F2~SNP"). This will output two list objects, reporting the SNP effects on the first and second latent factor, respectively.

  7. cores: When running in parallel, an optional argument specifying whether or not the user wants to use a certain number of cores for parallel processing. If no number is provided, the default is for the function to use 1 less than the total number of cores available in the local environment.

  8. toler: What tolerance level to use for matrix inversions. This is only something that needs to be of concern if warning/error messages are produced to the effect of "matrix could not be inverted".

  9. SNPSE: An optional argument specifying whether you want to set the SE of the SNP to a lower or higher value than the default of .0005. This number is set low to reflect that this value is essentially be treating as fixed.

  10. parallel: An optional argument specifying whether you want the function to be run in parallel, or to be run serially. As most outputs from sumstats will be quite long (i.e., > 1 million SNPs), running in parallel is highly recommended in order to save time; the default is to run in parallel.

  11. GC: Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate GWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv', which adjusts standard errors by multiplying with univariate LDSC intercept, and 'none' which does not correct the standard errors.

  12. MPI: Whether the function should use multi-node processing (i.e., MPI). Please note that this should only be used on a computing cluster on which the R package Rmpi is already installed.

  13. smooth_check: Whether the function should save the largest Z-statistic difference between pre- and post-smoothed genetic covariance matrices. We recommend setting this to TRUE to ensure that SNPs that require a high degree of smoothing (e.g., Z > 1.96) are removed from results.

  14. std.lv: Whether you want the latent variables to have variances of 1. The default is FALSE and this should not be set to TRUE if the researcher plans to calculate the implied sample size of the factors for downstream analysis.

  15. fix_measurement: Whether the measurement model (the factor loadings and the factor correlations) should be fixed across all SNPs. The default is TRUE.

  16. Q_SNP: Whether the function should automatically calculate Q_SNP for each factor. This result will only be listed for model output that reflects the SNP predicting a latent factor. The default is FALSE as the user may not be estimating a factor model when running userGWAS.

#specify the model
model<-"F1 =~ mood + misery + irritability + fedup + lonely + guilt
F2 =~ hurt + nervous + worry + tense + embarass + nerves
F1~~F2
F1 ~ SNP
F2 ~ SNP"

#run the multivariate GWAS using parallel processing
CorrelatedFactors<-userGWAS(covstruc = LDSCoutput, SNPs = neuroticism_sumstats, estimation = "DWLS", model = model, printwarn = TRUE, sub=c("F1~SNP", "F2~SNP"), cores = NULL, toler = FALSE, SNPSE = FALSE, parallel = TRUE, Output = NULL,GC="standard",MPI=FALSE,smooth_check=TRUE,fix_measurement=TRUE,Q_SNP=TRUE)

#note that the code written above specifies all arguments for completeness, but as many of these arguments
#are set to the package default it could also be written as below and produce the same result:
CorrelatedFactors<-userGWAS(covstruc = LDSCoutput, SNPs = neuroticism_sumstats, model = model, sub=c("F1~SNP", "F2~SNP"),smooth_check=TRUE,fix_measurement=TRUE,Q_SNP=TRUE)

The first five rows of CorrelatedFactors[1](/GenomicSEM/GenomicSEM/wiki/1) are shown below. The first 6 columns are pulled from the file created using sumstats and list the rsID, chromosome, BP, MAF from the reference file, A1, and A2. The "lhs", "op", and "rhs" column lists the outcome, operator, and predictor. In this case "F1 ~ SNP", indicates the first correlated factor regressed on the SNP. In CorrelatedFactors[2](/GenomicSEM/GenomicSEM/wiki/2), this will list "F2 ~ SNP", as this was the second piece of the sub argument passed above. The "est" column lists the corresponding effect. The "se_c" column lists the sandwich corrected standard error associated with this effect. The "Z_Estimate" column lists the ratio of est/se_c. The "Pval_Estimate" column lists the corresponding p-value of Z_Estimate.

The "chisq" column lists the model chi-square, the "chisq_df" column lists the degrees of freedom for the model, and the "chisq_pval" column lists the p-value associated with the chi-square. As the chi-square test is sensitive to sample size, and most GWAS are characterized by large samples, the chi-square will often be so significant that this column simply prints 0. The "AIC" column prints the Aikaike Information Criteria. Chisq, chisq_df, chisq_pval, and AIC are only provided if the argument modelchi = TRUE.

The Q_SNP column lists the value factor-specific, chi-square distributed heterogeneity index. Q_SNP_df lists the degrees of freedom for this test, which will be equivalent to one less than the number of factor indicators. Q_SNP_pval lists the p-value for the Q_SNP result. We recommend performing quality control on multivariate factor GWAS results by removing SNPs at some Q_SNP_pval threshold. For example, you might consider removing all SNPs at a Q_SNP_pval < 5e-8, along with any SNPs that are in LD with Q_SNP significant results so as to remove the larger, putative Q_SNP locus.

The error column indicates whether a particular run did not converge. The warning column will list whether a particular run raised any warnings in lavaan. This may include specific runs where the residual variance of an indicator is negative. The smooth_check column lists the largest Z-statistic change if the matrix required smoothing due to it being non-positive definite. For these SNPs, no smoothing is required, so that colunmn lists a series of 0s.

The figure directly below depicts as a path diagram an example of a correlated factors model run for SNP rs4552973. Note that the factor loadings and correlations are shown without standard errors because these are fixed by the fix_measurement argument. We highlight also that the factor loadings > 1 for factor 2 are possible when the factor is not standardized (i.e., the factor variance is < 1).

Calculating Sample Size for Factors Once the summary statistics for a common factor are produced, the user may wish to input those summary statistics into an outside program that requires a sample size (e.g., LDSC, LDpred). A description of one method for calculating effective sample size can be found in the online supplement of bioRxiv pre-print from Mallard et al. (2019): https://www.biorxiv.org/content/10.1101/603134v1.abstract. The equation described in the supplement can be written as below to produce the effective sample size for Factor 1. Note that the recommendation is also made to restrict the summary statistics to MAF limits of approximately 10% and 40% in order to produce more stable estimates.

##Calculate Implied Sample Size for Factor 1
#restrict to MAF of 40% and 10%
CorrelatedFactors1<-subset(CorrelatedFactors, CorrelatedFactors$MAF <= .4 & CorrelatedFactors$MAF >= .1)

N_hat_F1<-mean(1/((2*CorrelatedFactors1[1](/GenomicSEM/GenomicSEM/wiki/1)$MAF*(1-CorrelatedFactors1[1](/GenomicSEM/GenomicSEM/wiki/1)$MAF))*CorrelatedFactors1[1](/GenomicSEM/GenomicSEM/wiki/1)$SE^2))

Note that when the phenotype is a latent factor, the choice of scaling the factor will have a nontrivial effect on the estimate of N_hat. Here we scale the latent genetic factors with unit loading identification, such that N_hat can be intuitively interpreted as the expected sample size for the factor scaled in the heritability units of the standardized reference phenotype (i.e. the phenotype with the unstandardized loading fixed to 1.0, in this case MDD). If we were to scale the latent genetic factors with unit variance identification, N_hat would be interpreted relative to a factor that is 100% heritable, and N_hat would be unintuitively very small (because, all else being equal, highly heritable phenotypes require smaller sample sizes to detect genetic associations).

We note also that for LDSC, the sample size provided does not effect the univariate intercept or the intercept SE. Although the heritability estimate and SE are contingent on user provided sample size, the ratio of the heritability estimate over SE is retained regardless of the sample size provided and, therefore, the heritability z-statistic can still be calculated.

A note on serial versus parallel processing. Serial and parallel processing will return the same output, but are computationally implemented in a serial and parallel fashion, respectively. In our experience using high performance computing systems there were often memory issues when using parallel processing (unless you have access to large memory nodes), in which case serial processing may be preferred. However, on a personal laptop with multiple cores, the computing time will be reduced greatly when using parallel processing. We note that each run is independent of one another, so it is also appropriate to set-up your runs in separate batches. The function runs as many models as there are SNPs.