4. Common Factor GWAS - GenomicSEM/GenomicSEM GitHub Wiki

GenomicSEM for Common Factor GWAS

A powerful extension of Genomic SEM is to run a multivariate GWAS for which a common factor defined by genetic indicators is regressed on a SNP. This allows for estimation of a set of summary statistics for the common factor that represent the SNP effects on the common factor. Estimation of a common factor GWAS proceeds in 4 steps. Note that estimation of the GWAS models are all entirely independent of one another. This means that the summary statistics used as input can be arbitrarily split up and submitted to different jobs on a computing cluster to help with memory and run-time issues. Please also note that the sumstats function requires that the variables be listed in the same order as they are listed for the ldsc function.

Step 1: Munge the summary statistics

As in the non-GWAS examples, the summary statistics need to be munged in order to produce the S and V matrix that do not include the SNPs. All traits can be munged at once as in the example below. In order to use our package to prepare the summary statistics the munge function can be used to convert the summary statistics to the format expected by LD-score Regression. The summary statistics files input into the munge function at a minimum need to contain five pieces of information:

  1. The rsid of the SNP.
  2. An A1 allele column, with A1 indicating the effect allele.
  3. An A2 allele column, with A2 indicating the non-effect allele.
  4. Either a logistic or continuous regression effect.
  5. The p-value associated with this effect.

The package will automatically rename the column based on commonly observed names, but may return an error if the file contains trait-specific column headers (e.g., RSID_SCHIZOPHRENIA). We refer the use to the wiki example on the 3. Models without Individual SNP effects for details on how to munge the four traits used here for Anxiety, Major Depression, PTSD, and Alcohol use Disorder

Step 2: Run multivariable LDSC

The second step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V). This is achieved by running the ldsc function. We note that it would not be appropriate in this case to take output from runs of the original LD score regression to construct by hand the S and V matrix. This is because the sampling covariances that occupy the off-diagonal elements of the V matrix could not be filled in. The ldsc function, its arugments, and example code for the fou rtraits can also be found on the 3. Models without Individual SNP effects page

Step 3: Prepare the summary statistics for GWAS

The sumstats function takes care of a few things prior to running multivariate GWAS. In order to run a multivariate GWAS, the SNPs, and corresponding SNP effects, need to be coded across phenotypes so that the same allele is the reference allele in all cases. In all cases, coefficients and their SEs are then further transformed such that they are scaled relative to unit-variance scaled phenotypes. In order to appropriately perform this standardization for different types of betas (e.g., logistic betas, odds-ratios), the sumstats function takes a number of arguments outlined below. The sumstats function then combines these newly scaled betas and SEs into a single, listwise deleted file. Note that it is possible to back out a beta and SE when only Z-statistics are provided as long as you have the total sample size for a continuous trait or the sum of effective sample sizes across the contributing cohorts for a binary trait. Confused about how to specify the arguments for this function? See point 3 on the important resources and key information page. Please also note that the sumstats function requires that the variables be listed in the same order as they are listed for the ldsc function above. The sumstats function takes 4 necessary arguments, along with 6 additional arguments whose default is NULL:

  1. files: The name of the summary statistics files. This should be the same as the name of the files used for the munge function in Step 1 above and the files should be in the same listed order used for the ldsc function in step 2.

  2. ref: The reference file used to calculate SNP variance across traits. We use 1000 genomes phase 3 that is available for download here: https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v.

  3. trait.names: The names of the traits in the order that they are listed for the files.

  4. se.logit: Whether the SEs are on a logistic scale. Please note that for many traits, the SEs will be reported on a logistic scale even when the regression effects are reported as odds ratios (ORs). This information can usually be found in the README file for the summary statistics. If this is listed as FALSE, and all other additional arguments are FALSE, then the function is expecting the standard error of the OR. The function will automatically determine whether the effect column is an OR or on a logistic scale based on the median effect size.

  5. OLS: Whether the phenotype was a continuous outcome analyzed using an observed least square (OLS; i.e., linear) estimator. If no information is provided the default is for the function to assume FALSE. We write "OLS=NULL" for this argument, and all other arguments whose default is NULL, so it is clear what the argument names are when needed; however, this need not be written out if the argument is in fact FALSE for all of your traits.

  6. linprob: Whether the phenotype was a dichotomous outcome for which there are only Z-statistics in the summary statistics file -or- it was a dichotomous outcome analyzed using an OLS estimator, as is the case for certain UKB phenotypes analyzed using the Hail software. In this case, a form of standardization is performed where logistic betas are backed out from a signed effect column. If no information is provided the default is for the function to assume NULL. Here, the alcohol use disorder phenotype is processed using linprob as the summary statistics file only includes Z-statistics.

  7. N: A user provided N listed in the order the traits are listed for the files argument. If no information is provided, the default is for the function to assume NULL. When backing out a logistic beta using the linprob argument this requires the sum of effective sample sizes across the cohorts contributing to the GWAS. For OLS transformations this should be the total sample size. If sample sizes are being provided for some traits, but not others, then NA can be used for traits that the user does not wish to provide sample sizes for. If the summary statistics file includes a sample size column, then the user can also list NA if they wish to use the SNP-specific sample sizes to perform the rescaling. However ,we note again that this should be the sum of effective sample sizes for dichotomous traits. For this particular example, all traits are case/control and already report odds ratios or logistic betas, or in the case of alcohol use disorder only Z-statistics are provided but the sum of effective N is present in the GWAS summary data. Thus, this argument is listed as NULL for these traits.

  8. betas: This argument can be used to directly supply the column name in the GWAS summary stats of the betas for continuous traits if the user knows the GWAS was run on an an already standardized phenotype. In this case, the package will directly use these betas and will not perform any further transformations beyond aligning the sign to a reference allele. The main reason a user might use this argument is if MAF is not available in the summary statistics file for a continuous trait. More specifically, when in-sample MAF is not available, the function uses reference MAF to back out a partially standardized beta, and mismatch between in-sample and reference MAF may cause some minimal amount of bias in the relative scaling of the betas across summary statistics.

  9. info.filter: The INFO filter to use. The package default is 0.6.

  10. maf.filter: The MAF filter to use. The package default is 0.01.

  11. keep.indel: Whether insertion deletions (indels) should be included in the output. Default = FALSE.

  12. parallel: Whether the function should be run in parallel. Default = FALSE.

  13. cores: When running in parallel, whether you want the computer to use a certain number of cores. If parallel is set to true, default is to use either 1 less than the total number of cores available or when the number of available cores exceeds the number of traits to use as many cores as there are traits.

files<-c("MDD_withNeff.txt", "SORTED_PTSD_EA9_ALL_study_specific_PCs1.txt", "ANX_withNeff.txt","ALCH_withrsID.txt")
ref= "reference.1000G.maf.0.005.txt"
trait.names<-c("MDD","PTSD","ANX", "ALCH")
se.logit=c(T,T,T,F)
linprob=c(F,F,F,T)
info.filter=.6
maf.filter=0.01
OLS=NULL
N=NULL
betas=NULL

PSYCH_sumstats <-sumstats(files=files,ref=ref,trait.names=trait.names,se.logit=se.logit,OLS=NULL,linprob=linprob,N=NULL,betas=NULL,info.filter=info.filter,maf.filter=maf.filter,keep.indel=FALSE,parallel=FALSE,cores=NULL)

Step 4: Combine the summary statistics and LDSC output and run the common factor GWAS

Note that as of 4/6/21, we combine the addSNPs and GWAS functions to save memory. The output from addSNPs can no longer be fed to the GWAS functions. In order to run multivariate GWAS, the S and V matrices from LDSC and the newly prepared summary statistics files first need to be merged to create as many S and V matrices as there are SNPs. The commonfactorGWAS function does this by expanding the S and V matrices to include the SNP effects. In addition, the function transforms the GWAS summary statistics (i.e., regression coefficients) and their SEs into covariances and SEs of covariances by taking the product of the regression coefficient and SNP variance from the reference file. Finally, the function will smooth S and V matrices that are non-positive definite when including the SNP covariances. After constructing the S and V matrices, the function will run the common factor model.

Note that the commonfactorGWAS function use unit loading identification, such that the loading of the first indicator is fixed to 1 so that the model is identified. If you are receiving warnings related to model convergence, it can help to order your files in the ldsc and sumstats function such that the phenotype with the largest, unstandardized loading on the common factor is listed first. This can be determined by running commonfactor on just the ldsc matrix to determine what the factor structure looks like when not including the effect of an individual SNP. The commonfactorGWAS function takes the following arguments:

  1. covstruc: The output from LDSC.

  2. SNPs: The output from sumstats.

  3. estimation: Whether the models should be estimated using "DWLS" or "ML" estimation. The package default is "DWLS".

  4. cores: How many computer cores to use in parallel processing. The default is to use 1 less core than is available in the local environment.

  5. toler: What tolerance level to use for matrix inversions. This is only something that needs to be of concern if warning/error messages are produced to the effect of "matrix could not be inverted".

  6. SNPSE: An optional argument specifying whether you want to set the SE of the SNP to a value higher than the default of .0005. This number is set low to reflect that this value is essentially be treating as fixed.

  7. parallel: An optional argument specifying whether you want the function to be run in parallel, or to be run serially. As most outputs from sumstats will be quite long (i.e., > 1 million SNPs), running the function in parallel is highly recommended in order to save time; the default is for the function to run in parallel unless you are using a Windows operating system.

  8. GC: Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate GWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv', which adjusts standard errors by multiplying with univariate LDSC intercept, and 'none' which does not correct the standard errors.

  9. MPI: Whether the function should use multi-node processing (i.e., MPI). Please note that this should only be used on a computing cluster on which the R package Rmpi is already installed.

  10. TWAS: Whether you are running the function using output from FUSION to estimate a multivariate Transcriptome-wide association study (TWAS).

  11. smooth_check: Whether the function should save the largest Z-statistic difference between pre- and post-smoothed genetic covariance matrices.

#run the multivariate GWAS using parallel processing
PSYCH_factor <- commonfactorGWAS(covstruc = LDSCoutput, SNPs = PSYCH_sumstats, estimation = "DWLS", cores = NULL, toler = FALSE, SNPSE = FALSE, parallel = TRUE, Output = NULL,GC="standard",MPI=FALSE)

#note that the code written above specifies all arguments for completeness, but as many of these arguments
#are set to the package default it could also be written as below and produce the same result:
PSYCH_factor  <- commonfactorGWAS(covstruc = LDSCoutput, SNPs = PSYCH_sumstats)

The first five rows from this output are shown below:

The figure directly below depicts as a path diagram an example of the model run for SNP rs2073813:

The first 6 columns are pulled from the file created using sumstats and list the rsID, chromosome, BP, MAF from the reference file, A1, and A2. The "i" column is simply an ordered list of the runs. The "lhs", "op", and "rhs" column lists the outcome, operator, and predictor. In this case "F1 ~ SNP", indicates the common factor regressed on the SNP. The user should check that this column reads as such to confirm the appropriate results are being pulled. The "est" column lists the effect of the SNP on the common factor. The "se_c" column lists the sandwich corrected standard error associated with this effect. The "Z_Estimate" column lists the ratio of est/se_c. The "Pval_Estimate" column lists the corresponding p-value of Z_Estimate.

The Q column lists the heterogeneity estimate for that SNP. This indexes the extent to which the SNP effect is not mediated by the common factor, with larger values indicating greater heterogeneity. This is a chi-square distributed test statistic, with degrees of freedom for a common factor model equal to the number of indicators - 1. The "Q_df" column lists the degrees of freedom for Q, and the "Q_pval" column lists the p-value associated with the Q statistic. The fail column indicates whether a particular run did not converge. The warning column will list whether a particular run raised any warnings in lavaan. This may include specific runs where the residual variance of an indicator is negative.

Calculating sample size for factors. Once the summary statistics for a common factor are produced, the user may wish to input those summary statistics into an outside program that requires a sample size (e.g., LDSC, LDpred). A description of one method for estimating the expected sample size associated with the factor can be found in the online supplement of bioRxiv pre-print from Mallard et al. (2019): https://www.biorxiv.org/content/10.1101/603134v1.abstract. The equation described in the supplement can be written as below to produce the predicted sample size for the psychiatric factor. Note that the recommendation is also made to restrict the summary statistics to MAF limits of approximately 10% and 40% in order to produce more stable estimates.

#restrict to MAF of 40% and 10%
PSYCH_factor2<-subset(PSYCH_factor, PSYCH_factor$MAF <= .4 & PSYCH_factor$MAF >= .1)
#calculate expected sample size (N_hat)
N_hat<-mean(1/((2*PSYCH_factor2$MAF*(1-PSYCH_factor2$MAF))*PSYCH_factor2$se_c^2))

Note that when the phenotype is a latent factor, the choice of scaling the factor will have a nontrivial effect on the estimate of N_hat. Here we scale the latent genetic factors with unit loading identification, such that N_hat can be intuitively interpreted as the expected sample size for the factor scaled in the heritability units of the standardized reference phenotype (i.e. the phenotype with the unstandardized loading fixed to 1.0, in this case MDD). If we were to scale the latent genetic factors with unit variance identification, N_hat would be interpreted relative to a factor that is 100% heritable, and N_hat would be unintuitively very small (because, all else being equal, highly heritable phenotypes require smaller sample sizes to detect genetic associations).

Note that for LDSC, the sample size provided does not effect the univariate intercept or the intercept SE. Although the heritability estimate and SE are contingent on user-provided sample size, the ratio of the heritability estimate over SE is retained regardless of the sample size provided and, therefore, the heritability z-statistic can be calculated irrespective of how the factor is scaled when computing N_hat.

A note on serial versus parallel processing. Serial and parallel processing will return the same output, but are computationally implemented in a serial and parallel fashion, respectively. In our experience using high performance computing systems there were often memory issues when using parallel processing unless you have access to large memory nodes. If memory issues are a problem then serial processing may be preferred. However, on a personal laptop with multiple cores, the computing time will be reduced greatly when using parallel processing. On a personal laptop with 8 cores using parallel processing for a common factor with five indicators and 1 million SNPs, the run time was approximately 8 hours. These functions are running as many models as there are SNPs.