Pathway analysis - hms-dbmi/RaMeDiES GitHub Wiki

:construction: This page is still under construction! :construction:

Why analyze pathways?

Genes involved in the :small_orange_diamond:same molecular pathway:small_orange_diamond: ("functional unit") are frequently involved in :small_blue_diamond:similar patient phenotypic presentations:small_blue_diamond:. This means that we might draw statistical power from multiple independent, deleterious variants in the same functional units, rather than just in the same genes !

We must still be cautious...

However, such an approach should be pursued with caution, as patient phenotypes stemming from changes to different genes in the same functional unit may vary to a great extent. Such differences in patient presentations may render the clinical evaluations and therapeutic potential of statistically significant findings virtually impossible.

Our approach

To mitigate this issue, we first initially consider groups of patients with similar phenotypes, and then within each of these groups, assess the overrepresentation of deleterious mutations across established biological pathways.

github_pathways

:cyclone: Step 1: Compute semantic similarity of patients' phenotypes using Phrank

We compute all-against-all semantic similarity of patients' HPO phenotype terms using Phrank, for which open-source code is available.

Download the relevant Phrank Python scripts from Bitbucket into this working directory as follows:

# Update this to reflect the full path to this repo on your machine:  
RAMEDIES_REPO=/full/path/to/github/repo/RaMeDiES

mkdir -p $RAMEDIES_REPO/phrank
wget -O $RAMEDIES_REPO/phrank/__init__.py https://bitbucket.org/bejerano/phrank/raw/7e0adbbe07cb12589ddba730565d283f6f06507e/phrank/__init__.py
wget -O $RAMEDIES_REPO/phrank/utils.py https://bitbucket.org/bejerano/phrank/raw/7e0adbbe07cb12589ddba730565d283f6f06507e/phrank/utils.py

Create a tab-delimited file (called test_patient_hpoterms.tsv in our implementation) with the following columns:
- Required: patient_id corresponding to a unique patient, e.g., UDN123456
- Required: hpo_id with no spaces and prefixed with "HP:", e.g., HP:0004209
- (optional) patient_status with a string indicating affected, unaffected, or unknown; only affected patients are considered
- (optional) hpo_name full name of HPO term, e.g., Clinodactyly of the 5th finger
- (optional) hpo_status indicating if the HPO term is present (True) or absent (False); only True terms are considered
- (optional) quality_flag indicating PASS if the HPO terms should be considered; some artifactual HPO terms or those obtained from questionable procedures can be excluded; only PASS terms are considered
Compute all-against-all pairwise similarities between patients' sets of HPO terms and write out to file test_pairwise_similarity.tsv:

python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
    --pairwise_similarity \
    --hpo_file=test_patient_hpoterms.tsv \
    --similarity_file=test_pairwise_similarity.tsv

:cyclone: Step 2: Perform complete linkage hierarchical clustering to extract patient subgroups

We load pairwise similarity scores into a dissimilarity matrix in R to perform hierarchical clustering of patients and export the per-patient cluster assignments. You must have a local installation of R and you must have the cluster library installed (e.g., install.packages("cluster")).

We run the relevant R code from within the same Python script; the Python rpy library must be installed as well (e.g., pip3 install rpy2).

python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
    --cluster_patients \
    --similarity_file=test_pairwise_similarity.tsv \
    --cluster_assignments_file=test_cluster_assignments.tsv

:cyclone: Step 3: Assign per-patient candidate genes to each cluster

We identify candidate genes per patient and assign these genes to each cluster. In our implementation, we consider candidate genes to be:

all known diagnostic gene(s) as annotated in the Undiagnosed Diseases Network (UDN) researcher portal
genes harboring a compelling de novo variant (see details in our bioRxiv preprint) if no certain and complete diagnostic gene(s) were listed
genes harboring a compound heterozygous variant ranked in the top 100 using RaMeDiES individual-level test if no certain and diagnostic gene(s) were listed and fewer than two de novo candidates were identified.

You must create a tab-delimited list of candidate genes (called test_candidate_genes.tsv in our implementation) for all samples in your cohort with the following columns:

sample_id corresponding to a unique patient and matching sample_id from the previous step, e.g., UDN123456
gene_name corresponding to an HGNC gene symbol
(Optional) gene_type in our case, corresponding to known_diagnosis, denovo, or comphet
(Optional) diagnosis in our case, corresponding to diagnosis_match_{tentative/likely/certain}_{partial/complete}
(Optional) pheno_match in our case, corresponding to the Phrank similarity score between patient and gene

Run the following to generate a tab-delimited list of cluster assignments and patient details and query genes per cluster:

python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
    --query_genes \
    --cluster_assignments_file=test_cluster_assignments.tsv \
    --candidate_genes_file=test_candidate_genes.tsv \
    --query_genes_file=test_cluster_details.tsv \

:cyclone: Step 4: Run gene set enrichment analysis using g:Profiler API

Finally, we run gene set enrichment analysis on the sets of query genes created in the previous step, and we output tab-delimited results.

python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
    --run_gsea \
    --query_genes_file=test_cluster_details.tsv \
    --enrichment_output_file=test_cluster_enrichments.tsv