7. Transcriptome wide SEM (T SEM) - GenomicSEM/GenomicSEM GitHub Wiki

Transcriptome-wide Structural Equation Modeling (T-SEM)

There are five primary steps necessary to perform T-SEM within the Genomic SEM framework:

  1. Select and munge the summary statistics using the munge function.
  2. Run FUSION to obtain the univariate TWAS estimates
  3. Run read_fusion to prepare the FUSION output for T-SEM
  4. Run ldsc to obtain multivariable LD-score regression estimates.
  5. Estimate the effect of imputed gene expression on any parameter in a model. For example, we demonstrate here how to estimate the association between gene expression and a common factor indexing genetic overlap across seven cognitive traits.

The example below utilizes the GWAS summary statistics for seven cognitive traits from UK Biobank, directly available here: https://datashare.is.ed.ac.uk/handle/10283/3756 Note that we remove the space in the Symbol Digit file name prior to running any analyses to avoid having to put the file name in quotes throughout the code below.

Step 1: Munge the Summary Statistics

As the summary statistics in the link above are already munged we do not perform this step here. For a description and example of how to munge summary statistics, please refer to the Models without individual SNP Effects page. We recommend retaining the MHC region (i.e., avoid using the hm3_noMHC.snplist file) when munging as the FUSION software developers provide a set of MHC-specific functional weights that will produce missing results if the region is excluded.

Step 2: Run FUSION

Prior to running Step 2 it is necessary to install the FUSION package along with the necessary package dependencies and files. The steps directly below for getting FUSION up and running are taken directly from the FUSION website (http://gusevlab.org/projects/fusion/). This is with the exception that the last line of code for installing the plink2R package has been modified so that it is compatible with more recent versions of R. • Download and unpack the FUSION software package: wget https://github.com/gusevlab/fusion_twas/archive/master.zip unzip master.zip cd fusion_twas-master • Download and unpack the 1000 Genomes LD reference data: wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2 tar xjvf LDREF.tar.bz2 • Download and unpack the plink2R library: wget https://github.com/gabraham/plink2R/archive/master.zip unzip master.zip • Launch R and install required libraries: install.packages(c('optparse','RColorBrewer')) library(devtools) devtools::install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")

For the purposes of this exercise we utilize the functional weights for the Amygdala tissue from GTEX v7. This is also directly available for download at http://gusevlab.org/projects/fusion/. FUSION takes the munged summary statistics created in Step 1 as input and has 6 necessary arguments:

  1. sumstats: The file names/paths to files which point to the munged sumstats.

  2. weights: The file names for the .pos file that included in the folder of pre-computed functional weights that the FUSION developers make available.

  3. weights_dir: The file path to where the functional weights are stored

  4. ref_ld_chr: The file path and general naming format for the LD-scores. For example, the 1000 Genomes LD-scores that are downloaded directly above are in the LDREF folder, and following the naming scheme of 1000G.EUR.1., 1000G.EUR.2., etc.

  5. chr: Which chromosome is being analyzed.

  6. out: What FUSION should name the output file.

The code below runs FUSION for all seven summary statistics using a basic bash for loop. In practice, this loop can be nested to, for example, run FUSION across all chromosomes or multiple sets of functional weights. For the purposes of this exercise only, we analyze only chromosome 17, but in practice would analyze all chromosomes. Note that this code is not run in R, but rather launches FUSION in R using the Rscript command. The code directly below can be run via the terminal on a MAC or using Cygwin on a PC.

for g in Memory.sumstats Matrix.sumstats Memory.sumstats RT.sumstats SymbolDigit.sumstats TMTB.sumstats Tower.sumstats VNR.sumstats;
do
Rscript FUSION.assoc_test.R \
--sumstats $g \
--weights GTEx.Brain_Amygdala.P01/Brain_Amygdala.P01.pos \
--weights_dir GTEx.Brain_Amygdala.P01/ \
--ref_ld_chr ./LDREF/1000G.EUR. \
--chr 17 \
--out Amygdala_CHR17_$g.dat;
done

Step 3: Prepare the FUSION summary statistics for T-SEM

The read_fusion function takes care of a few things prior to running multivariate TWAS. In order to run a multivariate TWAS, the TWAS coefficients and their SEs are transformed such that they are scaled relative to unit-variance scaled phenotypes. The read_fusion function then combines these newly scaled betas and SEs into a single, listwise deleted file. Please also note that the read_fusion function requires that the variables be listed in the same order as they are listed for the ldsc function described in Step 4 below. The read_fusion function takes 4 arguments:

  1. files: The name of the FUSION summary statistics files. Note that if FUSION is used to produce summary statistics across all 22 chromosomes and/or for multiple tissue types then we recommend that you combine the FUSION outputs within a phenotype prior to running read_fusion. That is, as all T-SEM runs are independent of one another, there is no need to run read_fusion separately for each chromosome or tissue (in fact this would be an unnecessarily cumbersome way of going through the remaining steps to perform T-SEM).

  2. trait.names: The names of the traits in the order that they are listed for the files argument.

  3. N: A user provided N listed in the order the traits are listed for the files argument. Note that for binary traits this should reflect the sum of effective sample sizes across contributing cohorts; if this is for a binary trait of a single cohort (e.g., UK Biobank analyses) then it should reflect the effective sample size for that cohort.

  4. binary: Whether the phenotype was a binary outcome

  5. perm: Whether the output should use the permutation test p-values that can be requested from FUSION (default = FALSE).

files<-list("Amygdala_CHR17_Matrix.sumstats.dat", "Amygdala_CHR17_Memory.sumstats.dat", "Amygdala_CHR17_RT.sumstats.dat","Amygdala_CHR17_SymbolDigit.sumstats.dat","Amygdala_CHR17_TMTB.sumstats.dat","Amygdala_CHR17_Tower.sumstats.dat", "Amygdala_CHR17_VNR.sumstats.dat")
trait.names=c("Matrix","Memory","RT","SD","TMTB","Tower","VNR")
N=c(11356,331679,330024,87741,78547,11263,171304)
binary=c(F,F,F,F,F,F,F)
gfactor_genes<- read_fusion(files=files,trait.names=trait.names,N=N,binary=binary)

Step 4: Run Multivariable LD-Score Regression

The fourth step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V). This is achieved by running the ldsc function. We note that it would not be appropriate in this case to take output from runs of the original LD score regression package to construct, by hand, the S and V matrices. This is because the sampling covariances that occupy the off-diagonal elements of the V matrix could not be filled in. The following example takes summary statistics for the g-factor and runs multivariable LD score regression. The ldsc function takes 5 necessary arguments:

  1. traits: a vector of file names/paths to files which point to the munged sumstats.

  2. sample.prev: A vector of sample prevalences of length equal to the number of traits. Sample prevalence is calculated as the number of cases over the total number of participants (i.e., cases + controls). Possible range = 0-1. If the trait is continuous, the values should equal NA. If effective sample size was entered for munge for binary trait then this value should be 0.5.

  3. population.prev: A vector of population prevalences. These estimates can be obtained from a number of sources, such as large scale epidemiological studies. Possible range = 0-1. Again, if the trait is continuous the values should equal NA.

4/5. ld and wld: A folder of LD scores used as the independent variable in LDSC (ld) and LDSC weights (wld). These are typically the same folder, and in the original LD score package is called "eur_w_ld_chr". We use the same LD scores and weights for our application, though the user can supply their own if desired. Weights for the European population used here can be obtained by downloading the eur_w_ld_chr folder in the link below (Note that these are the same weights provided by the original developers of LDSC): https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v Note that these LD scores exclude the MHC (i.e., HLA) region due to the complex and long-range LD observed in this region of chromosome 6 that can unduly bias LDSC estimates. If you receive an error when running ldsc of not being able to change your working directory please be sure that you have specified the correct file path to the LD scores. Please note that the LD scores must match the ancestral background of the summary statistics provided (i.e., European summary statistics with European LD scores). It is not currently possible to run LD-score regression with ad-mixed ancestral backgrounds.

  1. trait.names: An optional argument specifying the trait names. This allows for model specification using the trait names in later steps, which can be useful for keeping track of results when the number of traits becomes large. If this argument is not specified, the function will automatically name the traits in the general form V1-VX.

  2. ldsc.log: An optional argument specifying how you want to name the ldsc.log file. The default is to name the log file using the file names of the munged summary statistics.

  3. stand: An optional argument specifying whether you want the ldsc to also output the genetic correlation matrix and its sampling covariance matrix. Default = FALSE.

  4. select: An optional argument specifying whether you want ldsc to estimate using only odd (select = "ODD") or even (select = "EVEN") chromosomes. This can be helpful if you want to perform exploratory analyses on odd chromosomes and confirmatory analyses on even chromosomes as a hold-out sample. It can also be set to a set of numbers, such as c(1,3,10), to run ldsc on a specific chromosome or chromosomes. Default = FALSE, in which case ldsc is estimated using all chromosomes.

ld <- "eur_w_ld_chr/"
wld <- "eur_w_ld_chr/"
traits <- c("Matrix.sumstats", "Memory.sumstats", "RT.sumstats", "SymbolDigit.sumstats", "TMTB.sumstats", "Tower.sumstats", "VNR.sumstats")
sample.prev <- c(NA,NA,NA,NA,NA,NA,NA)
population.prev <- c(NA,NA,NA,NA,NA,NA,NA)
trait.names<-c("Matrix","Memory","RT","SD","TMTB","Tower","VNR")
gfactor_LDSCoutput <- ldsc(traits, sample.prev, population.prev, ld, wld,trait.names=trait.names)

##optional command to save the ldsc output in case you want to use it in a later R session. 
save(gfactor_LDSCoutput, file="gfactor_LDSC.RData")

Step 5: Combine the FUSION summary statistics and LDSC output and run the common factor T-SEM

For reasons related to package maintenance, we avoid introducing an entirely new T-SEM function, but instead have modified the existing multivariate GWAS functions to accept a new TWAS argument that can be set to TRUE when running T-SEM analyses. We recognize that certain arguments (e.g., the SNPs argument) are technically speaking a misnomer in the context of T-SEM analyses, but we have endeavored to be as clear as possible in the instructions below how to specifically modify and think about these arguments when running T-SEM. For the current analyses, we analyze the effect of gene expression on the genetic g-factor using the commonfactorGWAS function, which takes the following arguments:

  1. covstruc: The output from LDSC.

  2. SNPs: In the context of T-SEM analyses, this will be output from read_fusion from Step 3 above.

  3. estimation: Whether the models should be estimated using "DWLS" or "ML" estimation. The package default is "DWLS".

  4. cores: How many computer cores to use in parallel processing. The default is to use 1 less core than is available in the local environment.

  5. toler: What tolerance level to use for matrix inversions. This is only something that needs to be of concern if warning/error messages are produced to the effect of "the matrix could not be inverted".

  6. SNPSE: An optional argument specifying whether you want to set the SE of the gene to a value higher than the default of .0005. This number is set low to reflect that this value is essentially be treating as fixed.

  7. parallel: An optional argument specifying whether you want the function to be run in parallel, or to be run serially. Unlike for multivariate GWAS, the number of rows in the FUSION output will generally be small enough that T-SEM results can be readily obtained on a personal laptop run in serial in as little as a few minutes or up to a few hours. That said, if you are running analyses for > 1,000 genes, running in parallel will still speed-up run times and can of course be utilized when preferred.

  8. GC: Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate TWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv', which adjusts standard errors by multiplying with univariate LDSC intercept, and 'none' which does not correct the standard errors.

  9. MPI: Whether the function should use multi-node processing (i.e., MPI). We do not recommend using MPI for T-SEM analyses.

  10. TWAS: An argument that should be set to TRUE when running T-SEM, which tells the commonfactorGWAS function to read-in and use output from read_fusion.

gfactor_TSEM<commonfactorGWAS(covstruc=gfactor_LDSCoutput,SNPs=gfactor_genes,parallel=FALSE,TWAS=TRUE)

The first six rows from this output are shown below:

The first 3 columns are pulled from the file created using read_fusion and list the Gene name, the reference Panel used to estimate the results in FUSION, and the variance of the gene (HSQ) pulled directly from FUSION. The "i" column is simply an ordered list of the runs. The "lhs", "op", and "rhs" column lists the outcome, operator, and predictor. In this case "F1 ~ Gene", indicates the common factor (g) regressed on gene expression. The user should check that this column reads as such to confirm the appropriate results are being pulled. The "est" column lists the effect of gene expression on the common factor. The "se_c" column lists the sandwich corrected standard error associated with this effect. The "Z_Estimate" column lists the ratio of est/se_c. The "Pval_Estimate" column lists the corresponding p-value of Z_Estimate.

The Q column lists the heterogeneity estimate for that gene. This indexes the extent to which the gene expression effect is not mediated by the common factor, with larger values indicating greater heterogeneity. This is a chi-square distributed test statistic, with degrees of freedom for a common factor model equal to the number of indicators - 1. The "Q_df" column lists the degrees of freedom for Q, and the "Q_pval" column lists the p-value associated with the Q statistic. The fail column indicates whether a particular run did not converge. The warning column will list whether a particular run raised any warnings in lavaan. This may include specific runs where the residual variance of an indicator is negative.