Preparing Input Files - ajaynadig/bhr GitHub Wiki

Preparing Input Files for BHR

Variant-level summary statistics

Overview: A text file, with a row per variant per trait and column per required variable (note mandatory column names). For example: if you are analyzing 15,000 genes, each with 10 variants, across two traits, this file will have 300,000 lines, excluding the header. Note: We have provided a Hail script in the example directory (file: genebass_variant_filter_january_2023.py) for downloading Genebass variant-level summary statistics and annotating the file with BHR parameters.

Required columns

a) Gene name (required column name: gene): Any gene naming convention is valid (i.e. ENSEMBL ID), as long as the convention is consistent with that used in the baseline-BHR file (see below)

b) Chromosome (required column name: chromosome): The chromosome of the gene.

c) Gene position in base pairs (required column name: gene_position): The position of the gene in base pairs. Note that these values are only used to order genes to divide them into jackknife blocks, so minor variations due to genome build, TSS vs midpoint, etc., should not meaningfully change results.

d) Phenotype sample size (required column name: N): Phenotype sample size in the association study. In the case of a case/control association study, this number should be n_cases + n_controls.

e) Variant per-allele effect sizes (required column name: beta): The per-allele effect size of the variant, e.g. in units of sd(phenotype)/sd(genotype), from a linear model. Note that, in the case of a case/control association study, inputting betas from a logistic regression will give incorrect output. Linear model per-allele effect sizes can be computed from case/control allele counts; see Equation 34 from the BHR paper and the example using BipEx data

f) Allele frequency (required column name: AF): The frequency of the allele. Note: users may also provide the variance of the allele instead of the allele frequency, with a column named variant_variance and setting custom_variant_variances = TRUE.

g) Phenotype name (required column name: phenotype_key): Phenotype name, any string

Note that having other columns with additional variant-level information will not interfere with the BHR analysis.

Example:

             gene chromosome gene_position      N      beta         AF phenotype_key
  ENSG00000000419         20      49563248 375630  0.014129 5.1037e-06          50NA
  ENSG00000000419         20      49563248 375630  0.445650 1.2695e-06          50NA
  ENSG00000000419         20      49563248 375630 -0.640410 1.2691e-06          50NA
  ENSG00000000419         20      49563248 375630  0.250800 2.5382e-06          50NA
  ENSG00000000419         20      49563248 375630 -0.382830 1.2746e-06          50NA
  ENSG00000000419         20      49563248 375630  0.288030 2.5496e-06          50NA

Baseline-BHR

Overview: A text file, with a row per gene and a column per gene set annotation, with elements equal to 1 to denote gene set membership and 0 otherwise. A Baseline-BHR file is required for BHR, as failure to control for frequency-dependent architecture can lead to bias in heritability estimates (analogous to motivation for baseline model in LD Score Regression).

We provide Baseline-BHR files with annotations corresponding to quintiles of the observed/expected loss-of-function distribution (see manuscript and reference_files in this repository). BHR will also estimate genetic architecture parameters for annotations in the baseline model.

Required variables

a) Gene name (required column name: gene): Same gene name convention as in the Gene-level summary statistics file

b) Gene membership annotations (required column names: no restrictions): 1 or 0 to denote presence/absence of gene in gene set. Note: if intercept = TRUE, the union of baseline annotations must not span all genes to avoid colinearity. As seen in example below, we avoid colinearity by omitting a single baseline annotation from the regression.

Example:

             gene baseline_oe1 baseline_oe2 baseline_oe3 baseline_oe4
  ENSG00000000419            0            0            1            0
  ENSG00000000457            0            1            0            0
  ENSG00000000460            0            0            1            0
  ENSG00000000938            0            1            0            0
  ENSG00000000971            0            1            0            0
  ENSG00000001036            0            0            1            0

Gene set annotations

Overview: BHR can accept an arbitrary number of gene sets, in addition to the Baseline-BHR annotations. BHR will estimate genetic architecture parameters for these gene sets.

Required variables

a) Gene name (required column name: gene): Same gene name convention as in the Gene-level summary statistics file

b) Gene membership annotations (required column names: no restrictions): 1 or 0 to denote presence/absence of gene in gene set

Example:

             gene gene_set_1
  ENSG00000187634          0
  ENSG00000188976          0
  ENSG00000187961          0
  ENSG00000187583          1
  ENSG00000187642          0
  ENSG00000188290          0