2. Preparing the GWAS Data - GenomicNetworkAnalysis/GNA GitHub Wiki

Steps required to prepare GWAS data to be used in GNA. This page includes steps for:

-Running LD-score regression to estimate genetic covariances that enable to subsequent estimation of the genome-wide network

-Preparing GWAS summary statistics to incorporate SNPs into the network using the gwasNET function

-Estimating imputed gene expression from GWAS data to incorporate genes into the network using the twasNET function

To run LDSC and/or to format the GWAS summary statistics to incorporate SNPs into the network each require different reference files that are specific to a certain genetic ancestry. Reference data files for European and East Asian genetic ancestry are included in the GNA package, which can each be extracted by:

# Load the GNA package
require(GNA)

#To extract the East Asian reference data files please run: 
refData("eas")

#To extract the European reference data files please run:
refData("eur")

This same refData function can also be used to load in example data that has already performed the prerequisite steps described below. If you are interested in getting an initial understanding for how the GNA package works, you can go straight to the other help pages, which includes the code to load in this example data at the top of each page. Of course, for your own empirical GNA applications, you should return to this page for details on how to prepare and format your GWAS traits.

Estimating a genetic covariance structure

Step 1: Munge the summary statistics

The first step is to standardize the formatting of the GWAS summary statistics files across the included traits, including aligning to the same reference allele and providing GWAS results as Z-statistics. The munge function in the GenomicSEM package will do this reformatting. Details on installing the GenomicSEM package can be found here. The GWAS summary statistics files provided as input to the munge function need to contain five pieces of information:

  1. The rsID of the SNP.
  2. An A1 allele column, with A1 indicating the effect allele.
  3. An A2 allele column, with A2 indicating the non-effect allele.
  4. Either a logistic or continuous regression effect.
  5. The p-value associated with this effect.

The package will automatically rename the column based on commonly observed names, but may return an error if the file contains trait-specific column headers (e.g., RSID_T2D). All traits can be munged at once as in the example below. The GWAS data for the included traits can be downloaded here.

The munge function takes 6 arguments:

  1. files: The name of the GWAS summary statistics files

  2. hm3: The name of the reference file. Here we use HapMap 3 SNPs.

  3. trait.names: The trait names that will be used to name the saved files

  4. N: Vector containing the sample size of each trait. For continuous traits (e.g., BMI) this is entered as the total sample size. If the GWAS data contains a sample size column then the argument can be listed as NA, as is done for T2D, BMI, and SBP below. For FPG, TG, and HDL these sample sizes are directly provided.

For case control traits that reflect a meta-analysis across multiple cohorts the sample size should be entered the sum of effective sample sizes across contributing cohorts -or- there should be a SNP-specific sum of effective sample size column in the GWAS data. When effective sample size is not a column in the GWAS data, it can be calculated within each cohort that contributes to the univariate GWAS meta-analysis as 4v(1-v)n, where v is the sample prevalence (cases over total cohort sample size). The sum of these cohort-specific effective sample sizes is then provided for this argument. For additional details regarding the use of sum of effective sample size see here. If sum of effective sample size is used, then the sample prevalence argument should be input as 0.5 when running ldsc in step 2 (described below). If the sumstats for a trait contain a sample size column, then the munge function will only use this column if NA is provided to the N argument for that trait.

  1. info.filter: The INFO filter. The package default is to filter to SNPs with INFO > 0.9. If the summary statistics do not contain an INFO column the function can still be run, but results should be interpreted keeping in mind that this cleaning step was missed.

  2. maf.filter: The minor allele frequency (MAF) filter. The package default is to filter to SNPs with MAF > 0.01.

Example code:

#load Genomic SEM package
require(GenomicSEM)

#vector of the GWAS summary statistics files
files <- c("T2D_EAS_Metal_LDSC-CORR_Neff.v2.txt.gz","FPG_EastAsian288K_Meta.txt.gz", "GCST90278654_BMI.tsv.gz", "GCST90278668_SBP.tsv.gz", "TG_EastAsian288K_Meta.txt.gz","HDL_EastAsian288K_Meta.txt.gz")

#define the reference file being used to allign alleles across summary stats
#here we are using hapmap3
hm3 <- "w_hm3.snplist"

#name the traits 
trait.names <- c("T2D","FPG","BMI","SBP","TG","HDL")

#list the sample size of each trait (effective N for case control traits; total N for continuous traits)
#as sumstats for T2D contains an 'Neff' column, NA is entered for this
#as the sumstats for BMI and SBP also contain an 'N' column, NA is entered for these traits
N <- c(NA,288127,NA,NA,288127,288127)

#define the imputation quality filter
info.filter <- 0.9

#define the MAF filter
maf.filter <- 0.01

#run munge
munge(files=files, hm3=hm3, trait.names=trait.names, N=N, info.filter=info.filter, maf.filter=maf.filter)

Step 2: Run multivariable LDSC

The second step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V). This is achieved by running the ldsc function from the GenomicSEM R Package. The ldsc function takes the following necessary arguments. For details on additional, optional arguments see here:

  1. traits: a vector of file names/paths to files which point to the munged sumstats.

  2. sample.prev: A vector of sample prevalences of length equal to the number of traits. Sample prevalence is calculated as the number of cases over the total number of participants (i.e., cases + controls). Possible range = 0-1. If you have access to the sum of effective sample sizes then this should be entered for munge in the prior step and sample prevalence should be entered as 0.5, as is done for Type 2 Diabetes in the example below. This is because the transformation of total sample size to sum of effective sample size applies the ascertainment correction such that it now reflects the sample size that would have been achieved if the GWAS used a balanced design of 50% cases and 50% controls. You can find more information about calculating sum of effective sample size here. If the trait is continuous, the values should equal NA.

  3. population.prev: A vector of population prevalences. These estimates can be obtained from a number of sources, such as large scale epidemiological studies. Possible range = 0-1. If the trait is continuous the values should equal NA.

  4. ld: A folder of LD scores used as the independent variable in LDSC. Here we use the East Asian LD score reference panel estimated by the original LDSC developers using the HapMap3 SNPs from the 1000 Genomes Phase 3 sample. LD scores must match the ancestral background of the summary statistics provided (i.e., East Asian summary statistics with East Asian LD scores). We provide these LD scores (the eas_w_ld_chr folder) here. These LD scores are loaded into your working environment when using the refData function described at the top of this page.

  5. wld: A folder of LD scores used as the regression weights in LDSC. These are usually the same LD scores used for the ld argument.

  6. trait.names: Argument specifying the trait names. This is how the traits will be named in the genomic network. Names need to match the order of traits in the traits argument.

Example code:

#load Genomic SEM package
require(GenomicSEM)

#file names of munged sumstats created in step 1
traits <- c("T2D.sumstats.gz", "FPG.sumstats.gz", "BMI.sumstats.gz","SBP.sumstats.gz","TG.sumstats.gz","HDL.sumstats.gz")

#sample prevalence (note: entered as 0.5 for T2D as the munged sumstats include the sum of effective sample size
sample.prev <- c(0.5,NA,NA,NA,NA,NA)

#population prevalence
population.prev <- c(0.1,NA,NA,NA,NA,NA)

#folder containing LD scores
#here we use East Asian LD scores from 1000 Genomes Phase 3
ld <- "eas_w_ld_chr/"
wld <- "eas_w_ld_chr/"

#vector of trait names
trait.names <- c("T2D","FPG","BMI","SBP","TG","HDL")

#run ldsc
LDSC_MET <- ldsc(traits=traits, sample.prev=sample.prev, population.prev=population.prev, trait.names=trait.names, ld=ld, wld=wld)

#optional code to save LDSC output for future use 
saveRDS(LDSC_MET, file="LDSC_MET.RDS")

The output (named LDSC_MET here) is a list that includes the following two key elements:

  1. LDSC_MET$S is the covariance matrix (on the liability scale for case/control designs).
  2. LDSC_MET$V which is the sampling covariance matrix in the format expected by lavaan.

If you want to output the standard errors (SEs), Z-statistics, or p-values of the ldsc estimates in a matrix formatting mirroring the ordering in the covariance (i.e., S) matrix, then you can run the three lines of code below.

k <- nrow(LDSC_MET$S)
SE <- matrix(0, k, k)
SE[lower.tri(SE,diag=TRUE)] <- sqrt(diag(LDSC_MET$V))
Z <- LDSC_MET$S/SE
p <- 2*pnorm(abs(Z),lower.tail=FALSE)

Preparing data for a Network GWAS

The sumstats function prepares the univariate GWAS summary statistics so that they can be incorporated into the genomic network. This function will align GWAS effect sizes to the same reference allele and standardize the results relative to the variance in the phenotype.

To appropriately perform this standardization for different types of betas (e.g., logistic betas, odds-ratios), the sumstats function takes a number of arguments outlined below. The sumstats function then combines these newly scaled betas and SEs into a single, listwise deleted file. Key arguments for the sumstats function are described below:

  1. files: The name of the univariate GWAS summary statistics files. Files should be in the same order as is used for the ldsc function.

  2. ref: The reference file used to calculate SNP variance across traits. We use 1000 genomes phase 3 that we make available for both East Asian and European genetic ancestry participants as part of the refData function detailed at the top of this help page.

  3. trait.names: The names of the traits in the order that they are listed for the files.

  4. se.logit: (TRUE/FALSE) Whether the standard errors (SEs) of the GWAS estimates are on a logistic scale. This is only relevant for binary (case/control) traits. For many binary outcomes, the SEs will be reported on a logistic scale even when the regression effects are reported as odds ratios (ORs). This information can usually be found in the README file for the summary statistics. If this is listed as FALSE, and all other additional arguments are FALSE, then the function is expecting the standard error of the OR. The function will automatically determine whether the effect column is an OR or on a logistic scale based on the median GWAS effect size across SNPs.

  5. OLS: (TRUE/FALSE) Whether the phenotype was a continuous outcome analyzed using an observed least square (OLS; i.e., linear) estimator.

  6. linprob: (TRUE/FALSE) Whether the phenotype was a dichotomous outcome for which there are only Z-statistics in the summary statistics file -or- it was a dichotomous outcome analyzed using an OLS estimator. In this case, a form of standardization is performed where logistic betas are backed out from a signed effect column. If no information is provided the default is for the function to assume NULL.

  7. N: A user provided N listed in the order the traits are listed for the files argument. This is needed when either OLS or linprob is TRUE for a trait and there is no sample size column in the GWAS data. For OLS transformations, this should be the total sample size. When backing out a logistic beta using the linprob argument this requires the sum of effective sample sizes across the cohorts contributing to the GWAS.

require(GenomicSEM)

#the GWAS files provided in the same order that they were listed for LDSc 
files <- c("T2D_EAS_Metal_LDSC-CORR_Neff.v2.txt.gz","FPG_EastAsian288K_Meta.txt.gz", "GCST90278654_BMI.tsv.gz", "GCST90278668_SBP.tsv.gz", "TG_EastAsian288K_Meta.txt.gz","HDL_EastAsian288K_Meta.txt.gz")

#the reference file that matches the genetic ancestry of the included traits
ref <- "reference.1000G.maf.0.005.eas.txt"

#the name of the triats
trait.names <- c("T2D","FPG","BMI","SBP","TG","HDL")

#whether the standard error of the GWAS estimate is on a logistic scale 
#(only relevant for binary outcomes)
se.logit <- c(T,F,F,F,F,F)

#whether the phenotype was a continuous outcome analyzed using a linear estimator
OLS <- c(F,T,T,T,T,T)

#whether the GWAS data only includes Z-statistics -or- is a binary outcome analyzed using a linear estimator
linprob <- c(F,F,F,F,F,F)

#list the sample size of each trait that has TRUE for OLS or linprob argument 
#should reflect sum effective N for case control traits; total N for continuous traits
#argument only necessary when this data is not included as a GWAS column
N <- c(NA,288127,NA,NA,288127,288127)

SNPDATA_MET <- sumstats(files=files, ref=ref, trait.names=trait.names, se.logit=se.logit, OLS=OLS, linprob=linprob, N=N)

#optional code to save SNP output for future use 
saveRDS(SNPDATA_MET, file="SNPDATA_MET.RDS")

Preparing data for a Network TWAS

Step 1: Run FUSION (univariate TWAS)

Prior to running Step 2 it is necessary to install the FUSION package along with the necessary package dependencies and files. The steps directly below for getting FUSION up and running are taken directly from the FUSION website. This is with the exception that the last line of code for installing the plink2R package has been modified so that it is compatible with more recent versions of R.

• Download and unpack the FUSION software package:

wget https://github.com/gusevlab/fusion_twas/archive/master.zip unzip master.zip cd fusion_twas-master

• Download and unpack the 1000 Genomes LD reference data:

wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2 tar xjvf LDREF.tar.bz2

• Download and unpack the plink2R library:

wget https://github.com/gabraham/plink2R/archive/master.zip unzip master.zip

• Launch R and install required libraries:

install.packages(c('optparse','RColorBrewer')) library(devtools) devtools::install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")

The psychiatric GWAS data used here is taken from the PGC data available here. Note that the files listed below do not reflect those taken directly from the PGC website. Rather, they are the files that have been formatted using the munge function described above.

For the purposes of this exercise we utilize the functional weights for the Cerebellum from GTEX v8. This is also directly available for download here. FUSION takes the munged summary statistics created in Step 1 as input and has 6 necessary arguments:

  1. sumstats: The file names/paths to files which point to the munged sumstats.

  2. weights: The file names for the .pos file that included in the folder of pre-computed functional weights that the FUSION developers make available.

  3. weights_dir: The file path to where the functional weights are stored

  4. ref_ld_chr: The file path and general naming format for the LD-scores. For example, the 1000 Genomes LD-scores that are downloaded directly above are in the LDREF folder, and following the naming scheme of 1000G.EUR.1., 1000G.EUR.2., etc.

  5. chr: Which chromosome is being analyzed.

  6. out: What FUSION should name the output file.

The code below runs FUSION for all seven psychiatric summary statistics using a basic bash for loop. In practice, this loop can be nested to, for example, run FUSION across all chromosomes or multiple sets of functional weights. For the purposes of this exercise only, we analyze only chromosome 17, but in practice would analyze all chromosomes. Note that this code is not run in R, but rather launches FUSION in R using the Rscript command. The code directly below can be run via the terminal on a MAC or using Cygwin on a PC.

Example code:

for g in AN.sumstats BIP.sumstats SCZ.sumstats ADHD.sumstats ASD.sumstats MDD.sumstats CUD.sumstats;
do
Rscript FUSION.assoc_test.R \
--sumstats $g \
--weights GTExv8.EUR.Brain_Cerebellum.pos \
--weights_dir GTExv8.EUR.Brain_Cerebellum/ \
--ref_ld_chr ./LDREF/1000G.EUR. \
--chr 17 \
--out Cerebellum_CHR17_$g.dat;
done

Step 2: Prepare the TWAS summary statistics for GNA

The read_fusion function takes care of a few things prior to running TNET. In order to estimate a genomic network that incorporates imputed gene expressed, the TWAS coefficients and their SEs are transformed such that they are scaled relative to unit-variance scaled phenotypes. The read_fusion function then combines these newly scaled betas and SEs into a single, listwise deleted file. Please also note that the read_fusion function requires that the variables be listed in the same order as they are listed for the ldsc function described in Step 2 above. The read_fusion function takes 4 arguments:

  1. files: The name of the FUSION summary statistics files. Note that if FUSION is used to produce summary statistics across all 22 chromosomes and/or for multiple tissue types then we recommend that you combine the FUSION outputs within a phenotype prior to running read_fusion. That is, as all TNET runs are independent of one another, there is no need to run read_fusion separately for each chromosome or tissue (in fact this would be an unnecessarily cumbersome way of going through the remaining steps to perform T-SEM).

  2. trait.names: The names of the traits in the order that they are listed for the files argument.

  3. N: A user provided N listed in the order the traits are listed for the files argument. Note that for binary traits this should reflect the sum of effective sample sizes across contributing cohorts; if this is for a binary trait of a single cohort (e.g., UK Biobank analyses) then it should reflect the effective sample size for that cohort.

  4. binary: Whether the phenotype was a binary outcome Example code:

# List of TWAS summary statistics files
#note these should be in the same order as the was used when running LDSC in Step 2
files<-list("Cerebellum_CHR17_AN.sumstats.dat", "Cerebellum_CHR17_BIP.sumstats.dat", "Cerebellum_CHR17_SCZ.sumstats.dat","Cerebellum_CHR17_ADHD.sumstats.dat","Cerebellum_CHR17_ASD.sumstats.dat","Cerebellum_CHR17_MDD.sumstats.dat", "Cerebellum_CHR17_CUD.sumstats.dat")

#trait names
trait.names<-c("AN", "BIP", "SCZ", "ADHD", "ASD", "MDD", "CUD")

# Sample sizes for each trait
N=c(45438.85,101068.2,117181.6,100868.6,43778,427682.7,27666.94)

# Whether each trait is binary (TRUE) or continuous (FALSE)
binary= c(rep(TRUE,7))

PSYC_Genes<- read_fusion(files=files,trait.names=trait.names,N=N,binary=binary)