Cohort level de novo recurrence - hms-dbmi/RaMeDiES GitHub Wiki

RamediesDN is an ultra-fast, resource-efficient algorithm for assessing whether specific genes harbor more or more deleterious de novo mutations across a cohort than expected by chance.
python ramediesDN.py --i=/full/path/to/github/repo/RaMeDiES/test/input --o=test
⚠️ By default, all properly-formatted and processed files within the directory specified by--i
will be considered. Other file types are ignored. You must remove processed input files from this directory to exclude them.❗ Expected output files for our provided input test files can be found in
test/output/
.Running on our provided test data, on a single 2.60GHz core with 0.5GB of RAM, should take about 3 minutes.
-
{prefix}_denovo_cohort_recurrence.txt
contains resulting per-gene P-values with indications for genes that passed the weighted FDR thresholds specified. The Bonferroni correction factor (should be ~16,354) is printed to screen and can be used to conservatively adjust P-values and weighted Q-values present in this file. -
{prefix}_denovo_variant_distribution.txt
contains the distribution of per-individual total de novo mutation counts by variant type (to enable basic quality control).
The following two summary statistics output files are created by default and can be used as input for running in meta-analysis mode.
-
{prefix}_denovo_variant_counts.txt
contains total numbers (across the cohort) of de novo mutations by variant type. -
{prefix}_denovo_mutational_targets.txt
contains per-gene mutational targets for all de novo mutations present across processed input variant files.
Details about each of these output files are listed below in the Output File Details section.
Print all parameters by running python ramediesDN.py --help
.
Parameter | Description |
---|---|
-h , --help
|
Show help message and exit |
--variant_annots <> |
Types of variants considered; C for coding, I for intronic. Default: CI
|
--i <> |
Input directory containing preprocessed variant files. |
--o <> |
Prefix for the output files. Default: DN_result
|
--no_qual_track |
Do not use the Roulette-derived quality control column for filtering. This flag should be used only if the input variant files have already been QCed and contain highly-confident variants. |
--coding_score |
Variant functionality score type to assign to coding SNP variants. Options: [CADD, REVEL, AlphaMissense, PAI3D]. Default: CADD |
--coding_snv_thr <> |
❗ Minimal variant functionality score allowed for coding SNP variants. Default: 0.5 for CADD (non-Phred-scaled). Suggested alternative values are 0.2 for REVEL, 0.1 for AlphaMissense and 0.3 for PrimateAI3D. |
--coding_indel_thr <> |
❗ Minimal variant functionality score allowed for coding indel variants. Default: 0.5 for CADD (non-Phred-scaled), which is the only currently supported deleteriousness score for coding indels. |
--SAI_thr <> |
❗ Minimal SpliceAI score allowed for intronic variants. Default: 0.05 |
--MAF <> |
❗ Maximal MAF (minor allele frequency) allowed for all variants. The "MAF" parameter must be specified. Default: -1 (no filter) |
--missense_run |
Flag to including only coding SNVs with a missense impact; this option is required when using REVEL, AlphaMissense, or PrimateAI3D which only score missense SNVs. |
--FDR <> |
Comma-separated FDR thresholds for output reporting. Default: 0.05,0.1 |
--gene_score <> |
Gene-based constraint score used to weight gene p-values for the weighted FDR approach. See the Gene constraint values section for options and details. Default: GeneBayes. |
--suppress_indels |
Flag to exclude indel variants |
--metadata_write_mode |
Flag to output only meta-analysis files (see items 3 and 4 in Expected Output Files section). Will produce an error if paired with --metadata_run_mode . |
--N_probands <> |
If --metadata_run_mode is enabled, this parameter indicates the number of probands across all input files and is required for false diagnosis rate estimates. Default: -1 (no false diagnosis rate estimated). |
--metadata_run_mode |
Run a meta-analysis on previously-computed {prefix}_denovo_variant_counts.txt and {prefix}_denovo_mutational_targets.txt files. If this flag is set, parameter --i is ignored, and parameter --M must be set (see below). |
--force_overwrite |
Overwrite existing output files with the same name. This flag has no impact if --metadata_run_mode is enabled. |
--M <> |
Comma-separated list of meta-analysis output files. Required if --metadata_run_mode is enabled |
❗ We highly recommend imposing variant functionality and MAF constraints (using our suggested values), because ramediesDN assumes that a single gene harbors at most one de novo variant (of interest).
⚠️ HOWEVER, we caution against imposing too strict of a variant functionality constraint. If there is signal in your cohort based on your variant functionality score (i.e., you have an enrichment of high-scoring, diagnostic variants!), a very high score constraint will force ramediesDN to be sampling from your signal distribution rather than from the expected uniform distribution under the null. This could result in potentially inflated significance values!
Disease genes with a de novo mode of inheritance are expected to be under strong selection against heterozygous loss-of-function variants. We incorporate this intuition by prioritizing genes by specific gene constraint values, using a weighted false discovery rate (FDR) procedure (Genovese et al., 2006). We provide the following options for the --gene_score=<>
parameter:
-
Genebayes
: mean of estimates of selection against heterozygous protein truncating variants (Zeng et al., 2024) - ❕
Genebayes_low95
: estimates of 95% confidence interval lower bound for Genebayes (Zeng et al., 2024) -
s_het
: mean shet estimates (Cassa et al., 2017) - ❕
s_het_low95
: estimates of 95% confidence interval lower bound for shet (Cassa et al., 2017) -
s_het_R
: Roulette-corrected per-gene mean shet estimates (Seplyarskiy et al., 2023) - ❕
s_het_R_low95
: Roulette-corrected per-gene estimates of 95% confidence interval lower bound for shet (Seplyarskiy et al., 2023) -
LOEUF
: loss-of-function observed/expected upper-bound fraction (LOEUF) estimates (Karczewski et al., 2020) -
PhyloP_prim
: mean per-gene Primate PhyloP score. -
PhyloP_mamm
: mean per-gene 447-way PhyloP score. -
MisFit_sgene_mis
: per-gene MisFit missense estimates (Zhao et al., 2023) -
MisFit_sgene_ptv
: per-gene MisFit protein-truncating variant estimates (Zhao et al., 2023) -
Mouse_knockout
: gene stratification based on phenotypes of mouse knockouts as reported in IMPC. Per-gene phenotype information was categorized into lethal, subviable, viable, and experiment not performed, and weights were calculated for each category based on the overrepresentation of OMIM dominant genes. -
OMIM_dom
only OMIM dominant genes are considered in the analyses (i.e., all other genes are assigned a weight of 0). This eliminates the possibility of finding novel disease genes, but because the effective number of tests is reduced by order of magnitude, this may significantly boost power in purely diagnostic studies where you hope to find known disease genes that recurrently harbor de novos in a cohort.
❕ In our experience, using the lower-bound estimate of these constraint values produces more robust sorting of longer genes, but results may be biased for shorter genes. On average, both the mean and low95 options of these gene constraint scores produce very similar enrichments of OMIM dominant genes in higher bins.
Sharing individual-level patient variant data is often in violation of patients' consent to research studies and institutional and ethical policies. Sharing summary-level statistics, however, can enable critical meta-analyses with the potential to uncover new disease genes and diagnoses without compromising patient privacy and institutional data security. Our framework exports summary-level statistics including per-cohort variant counts and binned mutational targets, neither of which can be used to recover individual patient data. We encourage publishing these summary statistics whenever possible.
By default, summary statistics files are produced when running ramediesDN on a collection of preprocessed input variant files. There is an option for a quicker run to produce these same files and no other output (--metadata_write_mode
). Suppose you have two datasets that you would like jointly analyze; the first (dataset1
) was built from exomes with poor indel calling, and the second (dataset2
) from genomes with high quality SNV/indel calling:
python ramediesDN.py \
--metadata_write_mode \
--variant_annots=C \
--suppress_indels \
--i=/full/path/to/preprocessed/variant/files/from/dataset1/ \
--o=dataset1
python ramediesDN.py \
--metadata_write_mode \
--i=/full/path/to/preprocessed/variant/files/from/dataset2/ \
--o=dataset2
which will output shareable files dataset1_denovo_variant_counts.txt
, dataset1_denovo_mutational_targets.txt
, dataset2_denovo_variant_counts.txt
and dataset2_denovo_mutational_targets.txt
.
⚠️ The same deleteriousness score (--coding_score
) and variant thresholds (--coding_snv_thr
,--coding_indel_thr
,--SAI_thr
,--MAF
) must be applied when producing the summary statistics to be used in a meta-analysis. These values are listed in the output files but not automatically checked for concordance.
⚠️ A single dataset cannot contain both exome and genome data (as the counts for variants in intronic space will only pertain to a subset of the data).
You can jointly analyze summary statistics obtained from these two different datasets with
python ramediesDN.py \
--metadata_run_mode \
--N_probands=3000 \
--M=/path/to/first/dataset1,/path/to/second/dataset2 \
--o=joint \
--FDR=0.01,0.05,0.1
This will produce an output file joint_denovo_cohort_recurrence.txt
with a ranked list of genes containing de novo variants across both datasets together.
⚠️ Parameter--M
must contain paths to directories but end with the prefix used for the output files. The parameter above expects four files in the following locations:
/path/to/first/dataset1_denovo_variant_counts.txt
/path/to/first/dataset1_denovo_mutational_targets.txt
/path/to/second/dataset2_denovo_variant_counts.txt
/path/to/second/dataset2_denovo_mutational_targets.txt
-
{prefix}_denovo_cohort_recurrence.txt
contains per-gene tab-delimited rows with 12 columns:-
file_names
: comma-separated list of input files where a variant in this gene was observed -
ensembl_gene_id
: Ensembl gene ID -
gene_name
: HGNC gene name -
Q_val
: weighted FDR Q-value (i.e., result of step 3 in graphic above) -
P_val
: unweighted P-value (i.e., result of step 2 in graphic above) -
P_cond
: unweighted P-value conditional on the observation of 1+ variants; these values are expected to be uniformly distributed under the null -
P_dnv
: probability of observing 1+ de novo mutations in this gene -
poisson_lambda
: expected number of de novo mutations to land in this gene -
false_diagnosis_rate
: rounded upper bound on the number of expected false diagnoses associated with this gene -
gene_constraint_weight
: normalized probability of a gene exclusively associated with dominant disorders in OMIM coming from the same gene constraint decile -
variant_info
: &-separated values for each observed de novo mutation in this gene. Each value contains the following pipe-delimited information:- variant chromosome
- variant reference allele
- variant position on chromosome
- variant alternate allele
- two-letter code specifying the variant type (
CS
= coding SNV,CI
= coding indel,IS
= intronic SNV,II
= intronic indel) - variant functionality score
- variant inheritance (i.e.,
DN
) - input variant file name
-
FDR_xx
: columns indicating whether the gene passed specific FDR thresholds as specified in the command line (default: 0.05, 0.1)
-
-
{prefix}_denovo_variant_distribution.txt
contains a header indicating variant functionality and MAF thresholds used and tab-delimited rows with four columns:-
inheritance
: in this case, alwaysDN
for de novo -
variant_type
: a two-letter code specifying the variant type:CS
= coding SNV,CI
= coding indel,IS
= intronic SNV,II
=intronic indel -
variant_count
: an observed exact number of de novo mutations of this type observed in a sample -
number_samples
: the number of samples containing this exact number of de novo mutations of this type
-
-
{prefix}_denovo_variant_counts.txt
contains four tab-delimited rows with three columns:-
inheritance
: alwaysDN
-
variant_type
: two-letter code specifying variant type (CS
,CI
,IS
, orII
as above) -
denovo_mutation_count
: total number of variants of this type and inheritance across the entire cohort
-
-
{prefix}_denovo_mutational_targets.txt
contains per-gene tab-delimited rows with three columns:-
variant_type
: two-letter code specifying variant type (CS
,CI
,IS
, orII
as above) -
ensembl_gene_id
: Ensembl gene ID -
per_patient_mutational_targets
: comma-separated list of mutational targets computed for each variant of the specified type in the specified gene
-