2.1 Calculating Sum of Effective Sample Size and Preparing GWAS Summary Statistics - GenomicSEM/GenomicSEM GitHub Wiki

Calculating Sum of Effective Sample Size and Preparing GWAS Summary Statistics

In a recent preprint (https://www.medrxiv.org/content/10.1101/2021.09.22.21263909v1), we describe how the sum of effective sample size across contributing cohorts for a GWAS meta-analysis of case/control cohorts is necessary in order to calculate an unbiased estimate of liability scale heritability. The key here is that effective sample size calculated using the total number of cases and controls fails to account for variability in levels of ascertainment across contributing cohorts. Here we show examples of different ways of calculating the sum of effective sample size along with examples of how to reformat summary statistics so that they can be appropriately interpreted by the 'munge' and 'sumstats' function in Genomic SEM. In addition, at the bottom of this page we list our recommendations for prioritizing different approaches to calculating sum of effective sample size for binary traits. Note that for summary statistics for case/control traits that were not meta-analyzed (e.g., an analysis within UK Biobank only) that the total sample size and sample prevalence may be provided for liability scale conversion per usual (or the effective sample size and sample prevalence of 0.5 may be entered as these should produce equivalent results for single cohort, case/control analyses). For continuous traits, the total sample size should be used for all calculations within Genomic SEM.

Anxiety Disorder

In this first example we go over how to reformat the Anxiety disorder summary statistics from Purves et al. (2019) that meta-analyzed case/control anxiety data across UK Biobank, iPSYCH and the ANGST consortium. These summary statistics can be downloaded here: https://drive.google.com/drive/folders/1fguHvz7l2G45sbMI9h_veQun4aXNTy1v.

In this case, information about the sample sizes of the cohorts within the consortiums are not available such that the sum of effective sample size cannot be directly calculated. In this instance, we back out effective sample size directly from the summary data using the equation: 4/(2pq x logistic_SE^2), where 2pq is the SNP variance calculated as 2MAF(1-MAF), where MAF is the minor allele frequency. As these summary statistics also do not include allele frequency information we use an approximation of the allele frequency from the 1000 Genomes Phase 3 European reference file that can be found here: https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v. Note also that we cap the estimated, sum of effective sample size at 1.1 and 0.5 of the total effective sample size following the recommendations from Prive et al. (2022) [https://www.biorxiv.org/content/10.1101/2021.03.29.437510v3]

#load in required packages
require(data.table)
require(dplyr)

#read in Anxiety summary stats from Purves et al. 2019
ANX<-fread("META_UKBB_iPSYCH_ANGST_wNcol.sumstats",data.table=FALSE)

#read in 1000 genomes reference file used to get approximation of SNP MAF
#as MAF not present in the anxiety summary statistics file
ref<-fread("reference.1000G.maf.0.005.txt",data.table=FALSE)

#subset reference file to just SNP and MAF
attach(ref)
ref<-data.frame(SNP,MAF)

#merge Anxiety and reference file
ANX<-inner_join(ANX,ref,by="SNP",all=F)

#calculate effective sample size implied by GWAS summary statistics 
ANX$Neff<-4/((2*ANX$MAF*(1-ANX$MAF))*ANX$StdErr^2)

#calculate total effective N to cap backed out Neff
Ncases<-31977 
Ncontrols<-82114
v<-Ncases/(Ncases+Ncontrols)
TotalNeff<-4*v*(1-v)*(Ncases+Ncontrols)

#cap at 1.1 of total effective N
ANX$Neff<-ifelse(ANX$Neff > 1.1*TotalNeff, 1.1*TotalNeff, ANX$Neff)

#lower limit of 0.5 of total effective N
ANX$Neff<-ifelse(ANX$Neff < 0.5*TotalNeff, 0.5*TotalNeff, ANX$Neff)

#remove reference panel MAF from file
ANX$MAF<-NULL

#remove sample size column so munge knows to use Neff column 
ANX$TotalSampleSize<-NULL

#output the summary stats with the SNP-specific effective sample size column
write.table(ANX, file = "ANX_withNeff.txt", sep = "\t", quote=FALSE,row.names=FALSE,col.names=TRUE)

Major Depressive Disorder

In our second example, we use the summary statistics from Howard et al. (2019) that meta-analyzed data for major depressive disorder across UK Biobank and the Psychiatric Genomics Consortium (PGC). This data can be downloaded here: https://datashare.ed.ac.uk/handle/10283/3203. As with the anxiety disorder summary stats, these summary stats do not include SNP-specific effective sample size, and there is insufficient information to directly calculate effective sample size based on cohort-level case/control numbers. However, these data do include in-sample allele frequency, which we use for backing out effective sample size.

#read in the data file from Howard et al. (2019)
MDD<-fread("PGC_UKB_depression_genome-wide.txt",data.table=FALSE)

#convert allele frequency column to minor allele frequency for effective sample size calculation below
MDD$MAF<-ifelse(MDD$Freq > .5, 1-MDD$Freq, MDD$Freq)

#remove Freq column now that we have created MAF column
MDD$Freq<-NULL

#calculate SNP-specific sum of effective sample size
MDD$Neff<-4/((2*MDD$MAF*(1-MDD$MAF))*MDD$StdErrLogOR^2)

#calculate total effective N to cap backed out Neff
#these are the case control numbers from table 1 of Howard et al. (2019)
#note that these numbers exclude 23andMe
Ncases<-127552+43204
Ncontrols<-233763+95680
v<-Ncases/(Ncases+Ncontrols)
TotalNeff<-4*v*(1-v)*(Ncases+Ncontrols)

#cap at 1.1 of total effective N
MDD$Neff<-ifelse(MDD$Neff > 1.1*TotalNeff, 1.1*TotalNeff, MDD$Neff)

#lower limit of 0.5 of total effective N
MDD$Neff<-ifelse(MDD$Neff < 0.5*TotalNeff, 0.5*TotalNeff, MDD$Neff)

#output the updated MDD file
write.table(MDD, file = "MDD_withNeff.txt", sep = "\t", quote=FALSE,row.names=FALSE,col.names=TRUE)

Alcohol Use Disorder

In the third example we use case/control summary statistics for Alcohol Use Disorder from the Walters et al. (2018) that are available for download here: https://figshare.com/articles/dataset/sud2018-alc/14672187?file=28169778. Note that we are specifically using the unrelated European summary statistics with the file name: pgc_alcdep.eur_unrelated.aug2018_release.txt. These summary statistics already contain a SNP-specific sum of effective sample size column (called 'Weight'), and no further calculations are needed. However, the SNP column is formatted as rsID:BP:A1:A2 and needs to be reformatted to contain only rsID; code for this is shown below.

We highlight that it is recommended to confirm whether effective sample size columns in fact reflect the sum of effective sample sizes across cohorts, or alternatively reflect the effective sample size calculated using the total number of cases and controls across cohorts, with only the former version being appropriate for Genomic SEM analyses. This can be determined by examining the ReadMe file for the summary statistics and/or by calculating the sum of effective sample size and determining whether any SNP-specific effective sample sizes are greater than this value. If this SNP-specific sample sizes exceed the sum of effective sample sizes this typically indicates it was calculated using the total number of cases and controls and the column should be deleted as it is not appropriate for Genomic SEM analyses. In this example for alcohol use disorder, we find that the maximum value in the Weight column is 23282.33, which is equal to the sum of effective sample sizes calculated using the cohort-level information that can be found in Table 1 of the Walters et al. manuscript.

#read in Alcohol data to make rsID
ALCH<-fread("pgc_alcdep.eur_unrelated.aug2018_release.txt",data.table=FALSE)

#load in stringr package for using strsplit function below
require(stringr)

#reformat SNP column to only include rsID
ALCH$SNP<-sapply(strsplit(ALCH$SNP, ":"), '[', 1)

#output datafile
write.table(ALCH, file = "ALCH_withrsID.txt", sep = "\t", quote=FALSE,row.names=FALSE,col.names=TRUE)

Post-traumatic Stress Disorder (PTSD)

In our fourth example, we use the PTSD Freeze 1 summary statistics from Duncan et al. (2018) available from the PGC website at: https://figshare.com/articles/dataset/ptsd2018/14672106. These summary statistics do not require any reformatting. In addition, while they do not include SNP-specific sum of effective sample sizes, information about the number of cases and controls in each cohort are available to directly calculate the sum of effective sample size. Oftentimes, this cohort-level information is available in Supplementary Tables. For PTSD, this is available in Supplementary Table 1 of the manuscript. We specifically pulled the European ancestry case/control numbers, put that in an .csv file, and read that .csv file in here to calculate the sum of effective sample size. This .csv file can be found at: https://utexas.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v. In practice, it is not required that this step be done in R, but is simply shown that way here for illustrative purposes.

##read in information about sample size per cohort
PTSD<-read.csv("PTSDcohorts.csv",header=TRUE)

#calculate sample prevalence for each cohort
PTSD$v<-PTSD$cases/(PTSD$cases+PTSD$controls)

#calculate cohort specific effective sample size
PTSD$EffN<-4*PTSD$v*(1-PTSD$v)*(PTSD$cases+PTSD$controls)

#calculate sum of effective sample size: 5831.346
sum(PTSD$EffN)

We use these reformatted summary statistics for many of the analyses throughout the remainder of the GenomicSEM wiki. We note that each set of summary statistics may have its own set of nuances, but have endeavored to select a set of four summary statistics as exemplars with different aspects to consider. In general, we recommend prioritizing calculating of effective sample size as follows, with option 1 being the best case scenario:

  1. If possible, use the SNP-specific sum of effective sample size information already in the summary statistics file, as was used with alcohol use disorder in the example above.

  2. If this information is not available, calculate sum of effective sample size using the cohort-level, case/control information that may be found in the manuscript supplementary tables. In our examples, this is what was done for PTSD.

  3. In some instances, the summary statistics may reflect a meta-analysis of meta-analyses, e.g., when large biobank data is meta-analyzed with consortium data, wherein the consortium is itself made up of a meta-analysis of individual cohorts. For analyses such as this, case/control numbers for the cohorts that make up the consortia may not be available, rendering option 2 not possible. If this happens we recommend backing out the SNP-specific effective sample and using the in-sample MAF to do so if this is available. This is what we did for the MDD example above.

  4. If options 1 and 2 are not possible, and in-sample MAF is not available, then reference panel MAF may be used as an approximation in order to back out SNP-specific effective sample size. This is what we did for the Anxiety example. We note that discrepancies between in-sample and reference panel MAF may cause biases, particularly for low MAF SNPs, but we have not observed these biases to substantially change qualitative conclusions drawn from analyses. Even still, this is why this is listed as a last resort option for calculating the sum of effective sample size and results should be interpreted in light of this limitation.