Cohort level compound heterozygosity - hms-dbmi/RaMeDiES GitHub Wiki

Compound heterozygous variant recurrence

github_comphet

:cyclone: About

RamediesCH is used to assess whether specific genes harbor more or more deleterious compound heterozygous variant configurations across a cohort than expected by chance.

:cyclone: Quick Run

python ramediesCH.py --i=/full/path/to/github/repo/RaMeDiES/test/input --o=test

:warning: Processed input variant files must include the inheritance column with mom or dad specified per variant.

:warning: By default, all properly-formatted and processed files within the directory specified by --i will be considered. You MUST REMOVE processed input files from this directory that correspond to individuals from families with genetic evidence of consanguinity.

:exclamation: Expected output files for our provided input test files can be found in test/output/.

Running on our provided test data, on a single 2.60GHz core with 0.5GB of RAM, should take about 2 minutes, 45 seconds.

:cyclone: Expected Output Files

{prefix}_comphet_cohort_recurrence.txt contains resulting per-gene unadjusted P-values. The Bonferroni correction factor (should be ~16354, the number of non-overlapping autosomal protein-coding genes) is printed to screen and can be used to conservatively adjust P-values.
{prefix}_comphet_variant_distribution.txt contains the distribution of per-individual counts of inherited rare variants (by variant type and inheritance) to enable basic outlier-detection quality control.
{prefix}_comphet_variant_product_distribution.txt contains the distribution of per-individual products of (# maternally-inherited variants) x (# paternally inherited variants) by variant type to enable further outlier-detection quality control.

The following two summary statistics output files are created by default and can be used as input for running in meta-analysis mode.

{prefix}_comphet_variant_counts.txt contains per-cohort sums of (# maternally-inherited variants)x(# paternally inherited variants) by variant type.
{prefix}_comphet_mutational_targets.txt contains per-gene, per-patient mutational targets for all compound heterozygous variants present in the input variant files.

Details about each of these output files are listed below in the Output File Details section.

:cyclone: Parameters

Print all parameters by running python ramediesCH.py --help.

Parameter	Description
`-h`, `--help`	Show help message and exit
`--variant_annots` <>	Types of variants considered; `C` for coding, `I` for intronic. Default: `CI`
`--i` <>	Input directory containing preprocessed variant files.
`--o` <>	Prefix for the output files. Default: `CH_result`
`--no_qual_track`	Do not use the Roulette-derived quality control column for filtering. This flag should be used only if the input variant files have already been QCed and contain highly-confident variants.
`--coding_score`	Variant functionality score type to assign to coding SNP variants. Options: [CADD, REVEL, AlphaMissense, PAI3D]. Default: CADD
`--coding_snv_thr <>`	:exclamation: Minimal variant functionality score allowed for coding SNP variants. Default: 0.5 for CADD (non-Phred-scaled). Suggested alternative values are 0.2 for REVEL, 0.1 for AlphaMissense and 0.3 for PrimateAI3D.
`--coding_indel_thr <>`	:exclamation: Minimal variant functionality score allowed for coding indel variants. Default: 0.5 for CADD (non-Phred-scaled), which is the only currently supported deleteriousness score for coding indels.
`--SAI_thr <>`	:exclamation: Minimal SpliceAI score allowed for intronic variants. Default: 0.05
`--MAF <>`	:exclamation: Maximal MAF (minor allele frequency) allowed for all variants. The "MAF" parameter must be specified. Default: -1 (no filter)
`--missense_run`	Flag to including only coding SNVs with a missense impact; this option is required when using REVEL, AlphaMissense, or PrimateAI3D which only score missense SNVs.
`--suppress_indels <>`	Flag to exclude indel variants
`--metadata_write_mode`	Flag to output only meta-analysis files (output by default) and not the final cohort-level recurrence results file. Will produce an error if paired with `--metadata_run_mode`.
`--N_probands <>`	If `--metadata_run_mode` is enabled, this parameter indicates the number of total number of probands (across all input summary statistic files) and is required for downstream false diagnosis rate estimates. Default: -1 (no false diagnosis rates estimated)
`--force_overwrite`	Overwrite existing output files if they already exist. This flag has no impact if `--metadata_run_mode` is enabled.
`--metadata_run_mode`	Run a meta-analysis on the summary statistics files produced by default or when `--metadata_write_mode` was enabled. If this flag is set, `--i` is ignored, and parameter `--M` must be set.
`--M <>`	Comma-separated list of full paths to summary statistics output files. Required if `--metadata_run_mode` is enabled.

:exclamation: We highly recommend imposing variant functionality and MAF constraints, because ramediesCH assumes that a single gene harbors at most one compound heterozygous variant pair in any given individual. The expected number of inherited, rare variants per gene (that could contribute to a compound heterozygous configuration) increases with more lenient variant functionality constraints. Note that these constraints are stricter in the compound heterozygous case than those applied in the de novo case.

:cyclone: Meta-Analysis Mode

:earth_americas: Producing Summary Statistics

By default, summary statistics files are produced when running ramediesCH on a collection of individual variant files. You can produce these same files with no other output by running --metadata_write_mode. Suppose you have two datasets that you would like to analyze together to uncover significant recessive disease genes: the first (dataset1) includes exomes with known poor indel calling quality, and the second (dataset2) includes high quality genomes.

python ramediesCH.py \
--metadata_write_mode \
--variant_annots=C \
--suppress_indels \
--i=/full/path/to/preprocessed/variant/files/from/dataset1/ \
--o=dataset1

python ramediesCH.py \
--metadata_write_mode \
--i=/full/path/to/preprocessed/variant/files/from/dataset2/ \
--o=dataset2

This produces shareable files dataset1_comphet_variant_counts.txt, dataset1_comphet_mutational_targets.txt, dataset2_comphet_variant_counts.txt and dataset2_comphet_mutational_targets.txt.

:warning: The same deleteriousness score (--coding_score) and variant thresholds (--coding_snv_thr, --coding_indel_thr, --SAI_thr, --MAF) must be applied when producing the summary statistics to be used in a meta-analysis. This information is available in the headers of these output files but must be manually checked for concordance.

:warning: A single dataset cannot contain both exome and genome data (as the intronic variant counts will pertain to only a subset of the data).

:earth_africa: Running a Meta-Analysis

You can jointly analyze summary statistics obtained from these two different datasets by running

python ramediesCH.py \
--metadata_run_mode \
--N_probands=2000 \
--M=/path/to/first/dataset1,/path/to/second/dataset2 \
--o=joint

This will produce an output file joint_comphet_cohort_recurrence.txt with a ranked list of genes containing compound heterozygous variants across both datasets together.

:warning: Parameter --M must contain paths to directories but end with the prefix used for the output files. The parameter above expects four files in the following locations:

/path/to/first/dataset1_comphet_variant_counts.txt

/path/to/first/dataset1_comphet_mutational_targets.txt

/path/to/second/dataset2_comphet_variant_counts.txt

/path/to/second/dataset2_comphet_mutational_targets.txt

:cyclone: Output File Details

{prefix}_comphet_cohort_recurrence.txt contains a header with the variant types considered, the Bonferroni correction factor, and rows for every gene that contained 1+ compound heterozygous variants in the starting dataset(s) with the following nine tab-delimited columns:
1. file_names: comma-separated list of input filenames where compound heterozygous variants in this gene have been observed
2. ensembl_gene_id: Ensembl gene ID
3. gene_name: HGNC gene name
4. P_val: uncorrected P-value
5. P_cond: uncorrected P-value conditional on the observation of 1+ compound heterozygous variant pairs in this gene across the cohort. These p-values are expected to be uniformly distributed under the null.
6. P_comphet: probability of observing 1+ compound heterozygous variant pairs in this gene
7. poisson_lambda: expected number of compound heterozygous variant pairs to land in this gene
8. false_diagnosis_rate: rounded upper bound for the number of expected false diagnoses associated with this gene
9. variant_info: comma-separated values for each compound heterozygous variant pair in this gene. Each value contains the following pipe-limited information for the paternally- then maternally-inherited variants (separated by an &):
  1. variant chromosome
  2. reference allele
  3. variant position
  4. alternate allele
  5. two-letter code specifying the variant type (CS = coding SNV, CI = coding indel, IS = intronic SNV, II = intronic indel)
  6. variant functionality score
  7. variant inheritance (first value in the &-delimited pair is P for paternal, and second value will be M for maternal)
  8. input variant file name
{prefix}_comphet_variant_distribution.txt contains the distribution of number of samples with specific variant counts for each type of variant:
1. inheritance: M for maternal and P for paternal
2. variant_type: a two-letter code specifying the variant type: CS = coding SNV, CI = coding indel, IS = intronic SNV, II = intronic indel
3. variant_count: the exact number of variants of this type and inheritance observed in a sample
4. number_samples: the number of samples with this exact number of variants of this type and inheritance
{prefix}_comphet_variant_product_distribution.txt contains the distribution of number of samples with specific variant count PRODUCTS for each type of compound heterozygous variant pair
1. paternal_variant_type: two-letter code specifying the type of maternally-inherited variant (CS, CI, IS, or II as above)
2. maternal_variant_type: two-letter code specifying the type of paternally-inherited variant (CS, CI, IS, or II as above)
3. product_of_variant_counts: (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) observed in a sample
4. number_samples: the number of samples with this exact product of (# maternal variants)x(# paternal variants) of the specified variant types
{prefix}_comphet_variant_counts.txt contains sixteen rows with the following three tab-delimited columns each:
1. paternal_variant_type: two-letter code specifying the type of maternally-inherited variant (CS, CI, IS, or II as above)
2. maternal_variant_type: two-letter code specifying the type of paternally-inherited variant (CS, CI, IS, or II as above)
3. total_product_of_variant_counts: total sum of PRODUCTS (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) per proband of the specified maternal and paternal variant types.
{prefix}_comphet_mutational_targets.txt contains sixteen tab-delimited rows for every gene, each with three columns:
1. variant_type: comma-separated two-letter codes specifying the type of variant pair (CS,CS, CS,CI, CS,IS, CS,II, CI,CS, CI,CI, CI,IS, CI,II, IS,CS, IS,CI, IS,IS, IS,II, II,CS, II,CI, II,IS, and II,II)
2. ensembl_gene_id: Ensembl gene ID
3. per_patient_mutational_targets: comma-separated list of compound heterozygous mutational targets computed for each variant pair of the specified type in the specified gene