Cohort level compound heterozygosity - hms-dbmi/RaMeDiES GitHub Wiki

Compound heterozygous variant recurrence

github_comphet

:cyclone: About

RamediesCH is used to assess whether specific genes harbor more or more deleterious compound heterozygous variant configurations across a cohort than expected by chance.

:cyclone: Quick Run

python ramediesCH.py --i=/full/path/to/github/repo/RaMeDiES/test/input --o=test

:warning: Processed input variant files must include the inheritance column with mom or dad specified per variant.

:warning: By default, all properly-formatted and processed files within the directory specified by --i will be considered. You MUST REMOVE processed input files from this directory that correspond to individuals from families with genetic evidence of consanguinity.

:exclamation: Expected output files for our provided input test files can be found in test/output/.

Running on our provided test data, on a single 2.60GHz core with 0.5GB of RAM, should take about 2 minutes, 45 seconds.

:cyclone: Expected Output Files

  1. {prefix}_comphet_cohort_recurrence.txt contains resulting per-gene unadjusted P-values. The Bonferroni correction factor (should be ~16354, the number of non-overlapping autosomal protein-coding genes) is printed to screen and can be used to conservatively adjust P-values.

  2. {prefix}_comphet_variant_distribution.txt contains the distribution of per-individual counts of inherited rare variants (by variant type and inheritance) to enable basic outlier-detection quality control.

  3. {prefix}_comphet_variant_product_distribution.txt contains the distribution of per-individual products of (# maternally-inherited variants) x (# paternally inherited variants) by variant type to enable further outlier-detection quality control.

The following two summary statistics output files are created by default and can be used as input for running in meta-analysis mode.

  1. {prefix}_comphet_variant_counts.txt contains per-cohort sums of (# maternally-inherited variants)x(# paternally inherited variants) by variant type.

  2. {prefix}_comphet_mutational_targets.txt contains per-gene, per-patient mutational targets for all compound heterozygous variants present in the input variant files.

Details about each of these output files are listed below in the Output File Details section.

:cyclone: Parameters

Print all parameters by running python ramediesCH.py --help.

Parameter Description
-h, --help Show help message and exit
--variant_annots <> Types of variants considered; C for coding, I for intronic. Default: CI
--i <> Input directory containing preprocessed variant files.
--o <> Prefix for the output files. Default: CH_result
--no_qual_track Do not use the Roulette-derived quality control column for filtering. This flag should be used only if the input variant files have already been QCed and contain highly-confident variants.
--coding_score Variant functionality score type to assign to coding SNP variants. Options: [CADD, REVEL, AlphaMissense, PAI3D]. Default: CADD
--coding_snv_thr <> :exclamation: Minimal variant functionality score allowed for coding SNP variants. Default: 0.5 for CADD (non-Phred-scaled). Suggested alternative values are 0.2 for REVEL, 0.1 for AlphaMissense and 0.3 for PrimateAI3D.
--coding_indel_thr <> :exclamation: Minimal variant functionality score allowed for coding indel variants. Default: 0.5 for CADD (non-Phred-scaled), which is the only currently supported deleteriousness score for coding indels.
--SAI_thr <> :exclamation: Minimal SpliceAI score allowed for intronic variants. Default: 0.05
--MAF <> :exclamation: Maximal MAF (minor allele frequency) allowed for all variants. The "MAF" parameter must be specified. Default: -1 (no filter)
--missense_run Flag to including only coding SNVs with a missense impact; this option is required when using REVEL, AlphaMissense, or PrimateAI3D which only score missense SNVs.
--suppress_indels <> Flag to exclude indel variants
--metadata_write_mode Flag to output only meta-analysis files (output by default) and not the final cohort-level recurrence results file. Will produce an error if paired with --metadata_run_mode.
--N_probands <> If --metadata_run_mode is enabled, this parameter indicates the number of total number of probands (across all input summary statistic files) and is required for downstream false diagnosis rate estimates. Default: -1 (no false diagnosis rates estimated)
--force_overwrite Overwrite existing output files if they already exist. This flag has no impact if --metadata_run_mode is enabled.
--metadata_run_mode Run a meta-analysis on the summary statistics files produced by default or when --metadata_write_mode was enabled. If this flag is set, --i is ignored, and parameter --M must be set.
--M <> Comma-separated list of full paths to summary statistics output files. Required if --metadata_run_mode is enabled.

:exclamation: We highly recommend imposing variant functionality and MAF constraints, because ramediesCH assumes that a single gene harbors at most one compound heterozygous variant pair in any given individual. The expected number of inherited, rare variants per gene (that could contribute to a compound heterozygous configuration) increases with more lenient variant functionality constraints. Note that these constraints are stricter in the compound heterozygous case than those applied in the de novo case.

:cyclone: Meta-Analysis Mode

:earth_americas: Producing Summary Statistics

By default, summary statistics files are produced when running ramediesCH on a collection of individual variant files. You can produce these same files with no other output by running --metadata_write_mode. Suppose you have two datasets that you would like to analyze together to uncover significant recessive disease genes: the first (dataset1) includes exomes with known poor indel calling quality, and the second (dataset2) includes high quality genomes.

python ramediesCH.py \
--metadata_write_mode \
--variant_annots=C \
--suppress_indels \
--i=/full/path/to/preprocessed/variant/files/from/dataset1/ \
--o=dataset1

python ramediesCH.py \
--metadata_write_mode \
--i=/full/path/to/preprocessed/variant/files/from/dataset2/ \
--o=dataset2

This produces shareable files dataset1_comphet_variant_counts.txt, dataset1_comphet_mutational_targets.txt, dataset2_comphet_variant_counts.txt and dataset2_comphet_mutational_targets.txt.

:warning: The same deleteriousness score (--coding_score) and variant thresholds (--coding_snv_thr, --coding_indel_thr, --SAI_thr, --MAF) must be applied when producing the summary statistics to be used in a meta-analysis. This information is available in the headers of these output files but must be manually checked for concordance.

:warning: A single dataset cannot contain both exome and genome data (as the intronic variant counts will pertain to only a subset of the data).

:earth_africa: Running a Meta-Analysis

You can jointly analyze summary statistics obtained from these two different datasets by running

python ramediesCH.py \
--metadata_run_mode \
--N_probands=2000 \
--M=/path/to/first/dataset1,/path/to/second/dataset2 \
--o=joint

This will produce an output file joint_comphet_cohort_recurrence.txt with a ranked list of genes containing compound heterozygous variants across both datasets together.

:warning: Parameter --M must contain paths to directories but end with the prefix used for the output files. The parameter above expects four files in the following locations:

  • /path/to/first/dataset1_comphet_variant_counts.txt
  • /path/to/first/dataset1_comphet_mutational_targets.txt
  • /path/to/second/dataset2_comphet_variant_counts.txt
  • /path/to/second/dataset2_comphet_mutational_targets.txt

:cyclone: Output File Details

  1. {prefix}_comphet_cohort_recurrence.txt contains a header with the variant types considered, the Bonferroni correction factor, and rows for every gene that contained 1+ compound heterozygous variants in the starting dataset(s) with the following nine tab-delimited columns:

    1. file_names: comma-separated list of input filenames where compound heterozygous variants in this gene have been observed
    2. ensembl_gene_id: Ensembl gene ID
    3. gene_name: HGNC gene name
    4. P_val: uncorrected P-value
    5. P_cond: uncorrected P-value conditional on the observation of 1+ compound heterozygous variant pairs in this gene across the cohort. These p-values are expected to be uniformly distributed under the null.
    6. P_comphet: probability of observing 1+ compound heterozygous variant pairs in this gene
    7. poisson_lambda: expected number of compound heterozygous variant pairs to land in this gene
    8. false_diagnosis_rate: rounded upper bound for the number of expected false diagnoses associated with this gene
    9. variant_info: comma-separated values for each compound heterozygous variant pair in this gene. Each value contains the following pipe-limited information for the paternally- then maternally-inherited variants (separated by an &):
      1. variant chromosome
      2. reference allele
      3. variant position
      4. alternate allele
      5. two-letter code specifying the variant type (CS = coding SNV, CI = coding indel, IS = intronic SNV, II = intronic indel)
      6. variant functionality score
      7. variant inheritance (first value in the &-delimited pair is P for paternal, and second value will be M for maternal)
      8. input variant file name
  2. {prefix}_comphet_variant_distribution.txt contains the distribution of number of samples with specific variant counts for each type of variant:

    1. inheritance: M for maternal and P for paternal
    2. variant_type: a two-letter code specifying the variant type: CS = coding SNV, CI = coding indel, IS = intronic SNV, II = intronic indel
    3. variant_count: the exact number of variants of this type and inheritance observed in a sample
    4. number_samples: the number of samples with this exact number of variants of this type and inheritance
  3. {prefix}_comphet_variant_product_distribution.txt contains the distribution of number of samples with specific variant count PRODUCTS for each type of compound heterozygous variant pair

    1. paternal_variant_type: two-letter code specifying the type of maternally-inherited variant (CS, CI, IS, or II as above)
    2. maternal_variant_type: two-letter code specifying the type of paternally-inherited variant (CS, CI, IS, or II as above)
    3. product_of_variant_counts: (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) observed in a sample
    4. number_samples: the number of samples with this exact product of (# maternal variants)x(# paternal variants) of the specified variant types
  4. {prefix}_comphet_variant_counts.txt contains sixteen rows with the following three tab-delimited columns each:

    1. paternal_variant_type: two-letter code specifying the type of maternally-inherited variant (CS, CI, IS, or II as above)
    2. maternal_variant_type: two-letter code specifying the type of paternally-inherited variant (CS, CI, IS, or II as above)
    3. total_product_of_variant_counts: total sum of PRODUCTS (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) per proband of the specified maternal and paternal variant types.
  5. {prefix}_comphet_mutational_targets.txt contains sixteen tab-delimited rows for every gene, each with three columns:

    1. variant_type: comma-separated two-letter codes specifying the type of variant pair (CS,CS, CS,CI, CS,IS, CS,II, CI,CS, CI,CI, CI,IS, CI,II, IS,CS, IS,CI, IS,IS, IS,II, II,CS, II,CI, II,IS, and II,II)
    2. ensembl_gene_id: Ensembl gene ID
    3. per_patient_mutational_targets: comma-separated list of compound heterozygous mutational targets computed for each variant pair of the specified type in the specified gene