Cohort level compound heterozygosity - hms-dbmi/RaMeDiES GitHub Wiki
Compound heterozygous variant recurrence
:cyclone: About
RamediesCH is used to assess whether specific genes harbor more or more deleterious compound heterozygous variant configurations across a cohort than expected by chance.
:cyclone: Quick Run
python ramediesCH.py --i=/full/path/to/github/repo/RaMeDiES/test/input --o=test
:warning: Processed input variant files must include the inheritance column with
mom
ordad
specified per variant.:warning: By default, all properly-formatted and processed files within the directory specified by
--i
will be considered. You MUST REMOVE processed input files from this directory that correspond to individuals from families with genetic evidence of consanguinity.:exclamation: Expected output files for our provided input test files can be found in
test/output/
.Running on our provided test data, on a single 2.60GHz core with 0.5GB of RAM, should take about 2 minutes, 45 seconds.
:cyclone: Expected Output Files
-
{prefix}_comphet_cohort_recurrence.txt
contains resulting per-gene unadjusted P-values. The Bonferroni correction factor (should be ~16354, the number of non-overlapping autosomal protein-coding genes) is printed to screen and can be used to conservatively adjust P-values. -
{prefix}_comphet_variant_distribution.txt
contains the distribution of per-individual counts of inherited rare variants (by variant type and inheritance) to enable basic outlier-detection quality control. -
{prefix}_comphet_variant_product_distribution.txt
contains the distribution of per-individual products of (# maternally-inherited variants) x (# paternally inherited variants) by variant type to enable further outlier-detection quality control.
The following two summary statistics output files are created by default and can be used as input for running in meta-analysis mode.
-
{prefix}_comphet_variant_counts.txt
contains per-cohort sums of (# maternally-inherited variants)x(# paternally inherited variants) by variant type. -
{prefix}_comphet_mutational_targets.txt
contains per-gene, per-patient mutational targets for all compound heterozygous variants present in the input variant files.
Details about each of these output files are listed below in the Output File Details section.
:cyclone: Parameters
Print all parameters by running python ramediesCH.py --help
.
Parameter | Description |
---|---|
-h , --help |
Show help message and exit |
--variant_annots <> |
Types of variants considered; C for coding, I for intronic. Default: CI |
--i <> |
Input directory containing preprocessed variant files. |
--o <> |
Prefix for the output files. Default: CH_result |
--no_qual_track |
Do not use the Roulette-derived quality control column for filtering. This flag should be used only if the input variant files have already been QCed and contain highly-confident variants. |
--coding_score |
Variant functionality score type to assign to coding SNP variants. Options: [CADD, REVEL, AlphaMissense, PAI3D]. Default: CADD |
--coding_snv_thr <> |
:exclamation: Minimal variant functionality score allowed for coding SNP variants. Default: 0.5 for CADD (non-Phred-scaled). Suggested alternative values are 0.2 for REVEL, 0.1 for AlphaMissense and 0.3 for PrimateAI3D. |
--coding_indel_thr <> |
:exclamation: Minimal variant functionality score allowed for coding indel variants. Default: 0.5 for CADD (non-Phred-scaled), which is the only currently supported deleteriousness score for coding indels. |
--SAI_thr <> |
:exclamation: Minimal SpliceAI score allowed for intronic variants. Default: 0.05 |
--MAF <> |
:exclamation: Maximal MAF (minor allele frequency) allowed for all variants. The "MAF" parameter must be specified. Default: -1 (no filter) |
--missense_run |
Flag to including only coding SNVs with a missense impact; this option is required when using REVEL, AlphaMissense, or PrimateAI3D which only score missense SNVs. |
--suppress_indels <> |
Flag to exclude indel variants |
--metadata_write_mode |
Flag to output only meta-analysis files (output by default) and not the final cohort-level recurrence results file. Will produce an error if paired with --metadata_run_mode . |
--N_probands <> |
If --metadata_run_mode is enabled, this parameter indicates the number of total number of probands (across all input summary statistic files) and is required for downstream false diagnosis rate estimates. Default: -1 (no false diagnosis rates estimated) |
--force_overwrite |
Overwrite existing output files if they already exist. This flag has no impact if --metadata_run_mode is enabled. |
--metadata_run_mode |
Run a meta-analysis on the summary statistics files produced by default or when --metadata_write_mode was enabled. If this flag is set, --i is ignored, and parameter --M must be set. |
--M <> |
Comma-separated list of full paths to summary statistics output files. Required if --metadata_run_mode is enabled. |
:exclamation: We highly recommend imposing variant functionality and MAF constraints, because ramediesCH assumes that a single gene harbors at most one compound heterozygous variant pair in any given individual. The expected number of inherited, rare variants per gene (that could contribute to a compound heterozygous configuration) increases with more lenient variant functionality constraints. Note that these constraints are stricter in the compound heterozygous case than those applied in the de novo case.
:cyclone: Meta-Analysis Mode
:earth_americas: Producing Summary Statistics
By default, summary statistics files are produced when running ramediesCH on a collection of individual variant files. You can produce these same files with no other output by running --metadata_write_mode
. Suppose you have two datasets that you would like to analyze together to uncover significant recessive disease genes: the first (dataset1
) includes exomes with known poor indel calling quality, and the second (dataset2
) includes high quality genomes.
python ramediesCH.py \
--metadata_write_mode \
--variant_annots=C \
--suppress_indels \
--i=/full/path/to/preprocessed/variant/files/from/dataset1/ \
--o=dataset1
python ramediesCH.py \
--metadata_write_mode \
--i=/full/path/to/preprocessed/variant/files/from/dataset2/ \
--o=dataset2
This produces shareable files dataset1_comphet_variant_counts.txt
, dataset1_comphet_mutational_targets.txt
, dataset2_comphet_variant_counts.txt
and dataset2_comphet_mutational_targets.txt
.
:warning: The same deleteriousness score (
--coding_score
) and variant thresholds (--coding_snv_thr
,--coding_indel_thr
,--SAI_thr
,--MAF
) must be applied when producing the summary statistics to be used in a meta-analysis. This information is available in the headers of these output files but must be manually checked for concordance.:warning: A single dataset cannot contain both exome and genome data (as the intronic variant counts will pertain to only a subset of the data).
:earth_africa: Running a Meta-Analysis
You can jointly analyze summary statistics obtained from these two different datasets by running
python ramediesCH.py \
--metadata_run_mode \
--N_probands=2000 \
--M=/path/to/first/dataset1,/path/to/second/dataset2 \
--o=joint
This will produce an output file joint_comphet_cohort_recurrence.txt
with a ranked list of genes containing compound heterozygous variants across both datasets together.
:warning: Parameter
--M
must contain paths to directories but end with the prefix used for the output files. The parameter above expects four files in the following locations:
/path/to/first/dataset1_comphet_variant_counts.txt
/path/to/first/dataset1_comphet_mutational_targets.txt
/path/to/second/dataset2_comphet_variant_counts.txt
/path/to/second/dataset2_comphet_mutational_targets.txt
:cyclone: Output File Details
-
{prefix}_comphet_cohort_recurrence.txt
contains a header with the variant types considered, the Bonferroni correction factor, and rows for every gene that contained 1+ compound heterozygous variants in the starting dataset(s) with the following nine tab-delimited columns:file_names
: comma-separated list of input filenames where compound heterozygous variants in this gene have been observedensembl_gene_id
: Ensembl gene IDgene_name
: HGNC gene nameP_val
: uncorrected P-valueP_cond
: uncorrected P-value conditional on the observation of 1+ compound heterozygous variant pairs in this gene across the cohort. These p-values are expected to be uniformly distributed under the null.P_comphet
: probability of observing 1+ compound heterozygous variant pairs in this genepoisson_lambda
: expected number of compound heterozygous variant pairs to land in this genefalse_diagnosis_rate
: rounded upper bound for the number of expected false diagnoses associated with this genevariant_info
: comma-separated values for each compound heterozygous variant pair in this gene. Each value contains the following pipe-limited information for the paternally- then maternally-inherited variants (separated by an&
):- variant chromosome
- reference allele
- variant position
- alternate allele
- two-letter code specifying the variant type (
CS
= coding SNV,CI
= coding indel,IS
= intronic SNV,II
= intronic indel) - variant functionality score
- variant inheritance (first value in the
&
-delimited pair isP
for paternal, and second value will beM
for maternal) - input variant file name
-
{prefix}_comphet_variant_distribution.txt
contains the distribution of number of samples with specific variant counts for each type of variant:inheritance
:M
for maternal andP
for paternalvariant_type
: a two-letter code specifying the variant type:CS
= coding SNV,CI
= coding indel,IS
= intronic SNV,II
= intronic indelvariant_count
: the exact number of variants of this type and inheritance observed in a samplenumber_samples
: the number of samples with this exact number of variants of this type and inheritance
-
{prefix}_comphet_variant_product_distribution.txt
contains the distribution of number of samples with specific variant count PRODUCTS for each type of compound heterozygous variant pairpaternal_variant_type
: two-letter code specifying the type of maternally-inherited variant (CS
,CI
,IS
, orII
as above)maternal_variant_type
: two-letter code specifying the type of paternally-inherited variant (CS
,CI
,IS
, orII
as above)product_of_variant_counts
: (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) observed in a samplenumber_samples
: the number of samples with this exact product of (# maternal variants)x(# paternal variants) of the specified variant types
-
{prefix}_comphet_variant_counts.txt
contains sixteen rows with the following three tab-delimited columns each:paternal_variant_type
: two-letter code specifying the type of maternally-inherited variant (CS
,CI
,IS
, orII
as above)maternal_variant_type
: two-letter code specifying the type of paternally-inherited variant (CS
,CI
,IS
, orII
as above)total_product_of_variant_counts
: total sum of PRODUCTS (count of maternally inherited variants of specified type) x (count of paternally inherited variants of specified type) per proband of the specified maternal and paternal variant types.
-
{prefix}_comphet_mutational_targets.txt
contains sixteen tab-delimited rows for every gene, each with three columns:variant_type
: comma-separated two-letter codes specifying the type of variant pair (CS,CS
,CS,CI
,CS,IS
,CS,II
,CI,CS
,CI,CI
,CI,IS
,CI,II
,IS,CS
,IS,CI
,IS,IS
,IS,II
,II,CS
,II,CI
,II,IS
, andII,II
)ensembl_gene_id
: Ensembl gene IDper_patient_mutational_targets
: comma-separated list of compound heterozygous mutational targets computed for each variant pair of the specified type in the specified gene