GWAS Harmonization And Imputation - hakyimlab/summary-gwas-imputation GitHub Wiki

Purpose

To harmonize GWAS to standard formats, impute missing variants.

Tutorial

For those that prefer to jump both feet in, there is a tutorial here that explains how to harmonize and impute summary statistics using 1000 Genomes reference data, to integrate with GTEx v8 QTL data.

Prerequisites

We assume the code in this repository to be downloaded somewhere, and a working python 3.5 installation with many dependencies (numpy, pandas, pyarrow, etc - see specific WIKI entry). We assume a UNIX+Bash environment with the following environment variables:

REPO=/path/to/your/repo
DATA=/path/to/your/data

In the following, feel free to replace $REPO and $DATA with the actual paths in your system. The following assumes GTEx data as genotype reference panel, which can't be freely shared. We have an alternative data set, publicly available and based on 1000 Genomes, here. Also available in Box here.

We assume that a metadata file was generated, as specified in the Variant selection subsection described here.

Depending on your particular use case, you might need a:

SNP definition file: a (gzipped) text file defining a variants chromosomal coordinates:

chromosome      start   end     name
chr10   50325   50325   rs10221297
. . .

We used trimmed-down versions of the dbSNP files (like http://hgdownload.cse.ucsc.edu/goldenpath/hg18/database/snp130.txt.gz). The end column is currently ignored.

A Liftover chain: this files allows to map from an older GWAS human genome of choice to a newer reference genome. You can get those here: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/

GWAS Summary Statistics harmonization

We refer to harmonization as a step that makes a gwas summary statistics data set compatible with the reference panel set of choice for your choice. It might be the case that a public GWAS of interest is defined on a specific version of the Human Genome that is old (i.e. hg18) while the reference panel uses a newer one (i.e. GRCh38). The universe of variants is different in both Human Genome references.

You might also notice that your GWAS lacks the relevant chromosomal coordinates needed to lift over from an older Humen Genome. Also, there might be format inconsistencies that break a reasonable parser. The zoology of public GWAS data sets is more messed up than a taxonomy of the Cthulhu Mythos.

Despair not! There is a script to tame such horrors as await the poor soul that ventures into the public GWAS nightmare of data sets. Note that we eagerly await any contribution a user may make in this effort: if a parsing feature is found in need, please let us know.

Without further ado, let us show how you would batter a wild GWAS text file into submission:

python3 $REPO/gwas_parsing.py \
-gwas_file $DATA/GLGC/jointGwasMc_LDL.txt.gz \
-output_column_map rsid variant_id \
-output_column_map A2 non_effect_allele \
-output_column_map A1 effect_allele \
-output_column_map beta effect_size \
-output_column_map P-value pvalue \
-output_column_map se standard_error \
-output_column_map SNP_hg19 chr_pos \
-split_column chr_pos ':' chromosome position \
-output_column_map N sample_size \
-output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases \
-liftover $DATA/liftover/hg19ToHg38.over.chain.gz \
-snp_reference_metadata $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz METADATA \
-output results_gwas/GLGC_Mc_LDL.txt.g

An explanation is in order.

-The arguments looking like -output_column_map A1 effect_allele merely specify how to convert column names from input GWAS columns to a target column name. So, they allow to convert to specific formats. This argument can be specified multiple times, and states how to rename input columns, from an input name to a target column name. Although there is some flexibility to the naming, we recommend at least chromosome, position, allele and id columns to have the output names stated here. Please consider using these name conventions because some (optional but important) harmonization steps are only performed if the data has columns identified with these names.

-output_order is just that: the order you want for the output columns. It also works as a filter: columns not mentioned here will be dropped.
panel_variant_id should appear on the list of -output_order unless you know what you are doing. panel_variant_id` is generated internally and will match the GTEx v8 prediction model variant id.
-split_column: use this to split a column that contains a composite of values. In this example, the chromosomal coordinate is specified in each entry as chr1:10507, and it can be split into prettier columns.
-lift_over If a Liftover chain file is given, conversion to newer reference will be performed. This is optional.
-snp_reference_metadata a file with reference metadata. If the flag METADATA is passed, the file is expected to be the output described here. This file is optional. If you provide such a file, the output will contain only entries successfully mapped to the reference data set. If your GWAS file has no frequency column (or missing values in it), frequencies will be filled with those found in the reference panel. Input variants are matched to the reference panel if they coincide in chromosome, position and reference/effect alleles. Swapped alleles (i.e GWAS states C T and reference states T C suffer a zscore and effect_size change of sign, and frequency mapped to 1-f)

Other supported options are:

--snp_info_blacklist USe this to exclude a specific list of variants from the output.
--chromosome_format: use this if your input GWAS only contains the chromosome number and you want to convert to chr1 for example. Please note that this repository' tools mostly expect GWAS to have chromosomes in this format.
--insert_value: allows to add a new column to the output with a specific value. i.e.e --insert_value n_cases 34840 --insert_value sample_size 149461 for a case/control study.

We support from optional arguments concerning different weird cases in some GWAS out there:

-separator: We expect GWAS columns to be split with tabs. Sometimes the GWAS are CSVs. Use this parameter if the GWAs is not tab-separated.
-skip_until_header: some GWAS files state they are gzip-compressed text files when they are not. they are actually tar.gz files with a wrong extension. Or they might had been gzipped with a buggy version of gzlib library (like the one used in older MAC OS; Apple, you stink). With this option, you instruct the code to skip the initial contents of a gzipped file until a particular string is found (i.e. the expected data header)
-fill_from_snp_info: use a SNP definition file (as specified in the prerequisites above, SNP definition file)) to fill the chromosomal coordinates of your file if they are missing.
-enforce_numeric_columns: I have ran across public GWAS that contain random strings in the results file. This flag will force zscore, pvalue, beta, se, or columns to be strictly numeric or NA.
-input_pvalue_fix: some gwas contain pvalues that are way too small for some platforms (i.e. 10^{-500}). This will put a lower threhsold of resolution: smaller p-values will be converted to the specified value. The default is 10^-50. Use 0 to disable this behavior.

If odds ratios are reported instead of effect sizes (log(OR)) use the option -output_column_map Odds_ratio or (CHECK THIS)

Options used for parsing the 114 GWAS used in the GTEx GWAS subgroups can be found here

GWAS summary stats imputation

The script gwas_summary_imputation.py imputes missing summary stats (actually, their zscore) using the covariance of the reference data set and the existing summary stats, within a specification of a region (such as Berissa & Pickrell's approximately independent LD blocks, appropriately lifted over). The underlying algorithm is the one from DIST and IMPG.

Some other operations are performed.

Palindromic snp correction. Palindromic snps (i.e C G, A T) are excluded from the input on account of possible strand ambiguity. They are imputed, and the end result is the original palindromic zscore absolute value and the sign obtained when imputing it.

python3 $REPO/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $DATA/formatted_gwas/GIANT_HEIGHT.txt.gz \
-parquet_genotype $DATA/reference/parquet_eur_maf0.01_biallelic/gtex_v8_eur_itm.chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference/parquet_eur_maf0.01_biallelic/gtex_v8_eur_itm.variants_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 0 \
--standardise_dosages \
-output results_summary_imputation/GIANT_HEIGHT_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz

This script takes as input a harmonized GWAS as in the previous section; and a reference panel as discussed in prerequisites.

-window argument specifies to include additional variants, this many bases around each region, when imputing. This is a safety measure; typically the variants outside each region have very little correlation and thus, little impact in the imputation
-chromosome: restrict the imputation to variants in a specific chromosome. This is useful to split the imputation job into smaller pieces.
-regularization: Use a Ridge-like regularization when inverting the covariance matrix (i.e. add a small number to the diagonal)
frequency: use only variants that exceed this MAF threshold.
-sub_batches: used to split an imputation job into smaller ones. This splits the list of available regions (or regions in this chromosome) into smaller subsets.
-sub_batch: which subset of regions to compute. It is zero-based, so must belong in the range [0, sub_batches-1]
--keep_palindromic_imputation: Option: report the imputed palindromic zscores.
--use_palindromic_snps: Option: keep palindromic snps in the input.
--standardise_dosages: Transform dosages to N(0,1) distribution. Effectively replaces covariance matrix with correlation matrix.
--cache_variants: this option does not alter the results: this is a runtime option that will cause each job to consume more memory, but run faster.

GWAS summary stats post-processing

It is most likely the GWAS imputation step was carried in a bunch of sub-jobs. This will output lots of results files, each containing the imputation results for a few regions. After that, the sub-results can be integrated with the original GWAS to produce a friendly file containing all the imputation products.

python3 $REPO/gwas_summary_imputation_postprocess.py \
-gwas_file $DATA/formatted_gwas_hg38_1.2/GLGC_Mc_LDL.txt.gz \
-folder results_summary_imputation \
-pattern GLGC_Mc_LDL.* \
-parsimony 7 \
-output processed_summary_imputation/imputed_GLGC_Mc_LDL.txt.gz

The output file will contain the imputed variants and those original variants used as imputation input. The optional flag --keep_all_observed will also output the original entries for imputed variants.