2. Important resources and key information - GenomicSEM/GenomicSEM GitHub Wiki

Likely you yourself have run a GWAS, but perhaps you have a hypotheses about traits which other groups analyzed in a GWAS, or perhaps you have a hypothesis about the genetic associations among a set of traits, but you don't know where to find the GWAS summary statistics for those traits. If you find summary statistics for all the traits you want to consider in a structural equation model, you are not done yet. You need to know how the GWAS was executed, and you need to know the nature of the trait: was it continuous, or was it a case/control analysis?

This page offers 2 things: 1) A discussion of what you need to know about a GWAS before you can include it in a GenomicSEM model and 2) a guide to the available GWAS summary statistics.

What you need to know about GWAS before you get started

  1. A genome wide association study (GWAS) boils down to a linear regression of a phenotype (y) on a genetic variant, usually a single nucleotide polymorphism (x). This regression results in a parameter estimate (beta), test statistic (Z or t) for each SNP, and information that can be used to determine with respect to which allele the effect size is computed. When available for a considerable portion of all SNPs, this information is sufficient to compute the heritability of the traits and genetic correlation between traits. This information is also sufficient to fit structural equation models to the genetic covariance between several traits.

  2. You need the full or very lightly cleaned summary statistics generated from a GWAS, so if the authors provide summary statistics only for the top 5.000 SNPs, or even the top 100.000 "pruned" SNPs this is not sufficient. Often if you get in touch with the authors, they have a mechanism for you to obtain the full summary statistics. Sometimes this may involve you agreeing not to identify the participants in their study. Sometimes you may need to sign some documents.

  3. You need to know whether the GWAS was a logistic regression or a linear regression. Note that not all case/control studies use logistic regression. This is because logistic regression can be computationally prohibitive if sample sizes are huge. When a dichotomous outcome (e.g. a case/control trait) is analyzed using a linear regression (e.g., binary outcomes for the HAIL GWAS), this is called a "linear probability model" and it is strictly speaking misspecified. The function sumstats does know how to deal with this scenario using the linprob argument. The package can also deal with a GWAS of a continuous trait being analyzed using linear regression (use the OLS flag in sumstats to indicate which GWAS are of continuous traits), or a case/control traits analyzed using logistic regression (the default in sumstats). The decision tree directly below can be used to decide what the correct arguments are for the sumstats function based on the scale of the GWAS outcome and how that outcome was analyzed. In the decision tree, we list the necessary arguments for each individual summary statistics file. In practice, you would combine the appropriate elements passed to each argument into a single vector to match the order the summary statistics are listed.

  1. Another issue is the use of "linear mixed models" (LMM) in GWAS. These models are used to guard against population stratification and cryptic relatedness between study participants. The pitfalls of, and opportunities afforded by, these models are discussed by Yang et al (2014). In this Yang et al. (2014) paper they examine the relationship between the chi-square from traditional linear regression and two different types of mixed model approaches: MLMe and MLMi. In Figure 1 of this paper you can see that MLMi leads to a clear deflation in the chi-square statistic relative to linear regression, and will consequently result in a deflation in the heritability estimate obtained from LDSC and is therefore not appropriate for LDSC or Genomic SEM. This is not super relevant because most recently developed packages use MLMe, including SAIGE and BOLT-LMM. The relationship between heritability estimates obtained from linear regression and MLMe is not exactly 1, but we have generally found for traits with h2 < 50% and participant sample sizes < 400k that BOLT-LMM and traditional linear regression LDSC results are very similar when comparing estimates obtained from the same participant samples and phenotypes. Therefore, when running a model without individual SNP effects using the usermodel or commonfactor functions, estimates from MLMe (and consequently BOLT-LMM) seem to be fine to use in Genomic SEM.

This said, while the heritability estimate should be very consistent between MLMe and traditional linear models, Loh et al. (2018) find in a paper titled “Mixed-model association for biobank-scale datasets,” that it is the univariate LDSC intercept in BOLT-LMM that is the most difficult to interpret as it will “rise above 1 with increasing sample size and heritability.” This LDSC univariate intercept is used in the sampling covariance matrix for the multivariate GWAS function in Genomic SEM to account for confounding, and so BOLT-LMM estimates will produce an inflated LDSC intercept and consequently conservative multivariate GWAS estimates. Since BOLT-LMM controls for sample structure that would otherwise be captured by the LDSC univariate intercept, for BOLT-LMM traits it is most sensible to set the LDSC intercept to 1. For right now, this can be achieved by manually setting the diagonals of the intercept matrix (the third part of the LDSC output named I) for the traits analyzed using BOLT-LMM to 1.

  1. An important issue is the exact sample size. Be wary of the sample size reported in abstracts. Sometimes it includes a replication cohort which did not contribute to the summary statistics. Sometimes it includes a dataset that cannot be freely shared by the authors themselves. The consumer genomics company 23andMe, for example, doesn't allow authors to share summary statistics data on all SNPs if its participants are in a study. You can request access to the summary data via 23andMe or you can often obtain summary statistics from a GWAS consortium that are based on all contributing datasets except for 23andMe. Before you use a reported N, make sure it is the correct one, as your choice of N will influence results in GenomicSEM. In many cases, the summary statistics will include a sample size column, which the package will use for calculations.

  2. In order for ld-score regression to produce accurate results it is critical that the user both include only summary statistics that were calculated within a single ethnic group AND that these summary statistics are matched with LD scores for the same ethnicity. All examples on the wiki currently use European only summary statistics coupled with European LD scores. However, this is strictly due to the current availability of European summary statistics relative to other ethnic groups, and both ld-score regression and Genomic SEM can be applied to summary statistics from other populations as these become available.

Where to get GWAS summary statistics.

Below is a brief, and incomplete list of links to consortia data pages, where summary statistics are available.

  1. The PGC (Psychiatric Genomics Consortium), has analyzed all common DSM-IV axis-I psychiatric disorders (MDD, Schizophrenia, ADHD, OCD, Bipolar Disorder and more)

  2. The SSGAC (Social Sciences Genetic Association Consortium) performs genome wide association studies of a variety of social and psychological traits like education, personality, and reproductive behavior.

  3. The Nealelab quickly ran and published online GWAS of >4000 traits that were measured as part of the UK Biobank. These traits include many disease (ICD-10 diagnostic codes, both self reported and based on hospital data), social traits (e.g. social deprivation), personality traits (e.g. neuroticism), cognition (e.g. memory) and many more (from snoring to the propensity to drive to fast). The Nealelab ran these GWAS very quickly and as a service to the field. Their GWAS of case/control traits use linear regression (linear probability model). Please read their extensive read me which describes their GWAS analysis in detail.

  4. The CCACE (Centre for Cognitive Ageing and Cognitive Epidemiology has published GWAS on assorted personality traits, cognitive traits, and tiredness.

  5. Members of the CTGlab (Complex Trait Genetics Lab) published several high quality GWAS on IQ, insomnia and other traits.

  6. The GPC (Genetics of Personality Consortium) published several, slightly dated, GWAS on the "Big 5" personality scales.

  7. The EGG (Early Growth Genetics) Consortium performs GWAS of traits related to early growth.

  8. The GIANT consortium publishes GWAS, mainly about antropomorpic traits.

  9. The ENIGMA consortium which has published GWAS of subcortical brain volumes and hippocampal volumes.