Pathway analysis - hms-dbmi/RaMeDiES GitHub Wiki
:construction: This page is still under construction! :construction:
Why analyze pathways?
Genes involved in the :small_orange_diamond:same molecular pathway:small_orange_diamond: ("functional unit") are frequently involved in :small_blue_diamond:similar patient phenotypic presentations:small_blue_diamond:. This means that we might draw statistical power from multiple independent, deleterious variants in the same functional units, rather than just in the same genes !
We must still be cautious...
However, such an approach should be pursued with caution, as patient phenotypes stemming from changes to different genes in the same functional unit may vary to a great extent. Such differences in patient presentations may render the clinical evaluations and therapeutic potential of statistically significant findings virtually impossible.
Our approach
To mitigate this issue, we first initially consider groups of patients with similar phenotypes, and then within each of these groups, assess the overrepresentation of deleterious mutations across established biological pathways.
:cyclone: Step 1: Compute semantic similarity of patients' phenotypes using Phrank
We compute all-against-all semantic similarity of patients' HPO phenotype terms using Phrank, for which open-source code is available.
-
Download the relevant Phrank Python scripts from Bitbucket into this working directory as follows:
# Update this to reflect the full path to this repo on your machine: RAMEDIES_REPO=/full/path/to/github/repo/RaMeDiES mkdir -p $RAMEDIES_REPO/phrank wget -O $RAMEDIES_REPO/phrank/__init__.py https://bitbucket.org/bejerano/phrank/raw/7e0adbbe07cb12589ddba730565d283f6f06507e/phrank/__init__.py wget -O $RAMEDIES_REPO/phrank/utils.py https://bitbucket.org/bejerano/phrank/raw/7e0adbbe07cb12589ddba730565d283f6f06507e/phrank/utils.py
-
Create a tab-delimited file (called
test_patient_hpoterms.tsv
in our implementation) with the following columns:- Required:
patient_id
corresponding to a unique patient, e.g., UDN123456 - Required:
hpo_id
with no spaces and prefixed with "HP:", e.g., HP:0004209 - (optional)
patient_status
with a string indicatingaffected
,unaffected
, orunknown
; only affected patients are considered - (optional)
hpo_name
full name of HPO term, e.g.,Clinodactyly of the 5th finger
- (optional)
hpo_status
indicating if the HPO term is present (True) or absent (False); only True terms are considered - (optional)
quality_flag
indicating PASS if the HPO terms should be considered; some artifactual HPO terms or those obtained from questionable procedures can be excluded; only PASS terms are considered
- Required:
-
Compute all-against-all pairwise similarities between patients' sets of HPO terms and write out to file
test_pairwise_similarity.tsv
:
python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
--pairwise_similarity \
--hpo_file=test_patient_hpoterms.tsv \
--similarity_file=test_pairwise_similarity.tsv
:cyclone: Step 2: Perform complete linkage hierarchical clustering to extract patient subgroups
We load pairwise similarity scores into a dissimilarity matrix in R to perform hierarchical clustering of patients and export the per-patient cluster assignments. You must have a local installation of R and you must have the cluster library installed (e.g., install.packages("cluster")
).
We run the relevant R code from within the same Python script; the Python rpy library must be installed as well (e.g., pip3 install rpy2
).
python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
--cluster_patients \
--similarity_file=test_pairwise_similarity.tsv \
--cluster_assignments_file=test_cluster_assignments.tsv
:cyclone: Step 3: Assign per-patient candidate genes to each cluster
We identify candidate genes per patient and assign these genes to each cluster. In our implementation, we consider candidate genes to be:
- all known diagnostic gene(s) as annotated in the Undiagnosed Diseases Network (UDN) researcher portal
- genes harboring a compelling de novo variant (see details in our bioRxiv preprint) if no certain and complete diagnostic gene(s) were listed
- genes harboring a compound heterozygous variant ranked in the top 100 using RaMeDiES individual-level test if no certain and diagnostic gene(s) were listed and fewer than two de novo candidates were identified.
You must create a tab-delimited list of candidate genes (called test_candidate_genes.tsv
in our implementation) for all samples in your cohort with the following columns:
sample_id
corresponding to a unique patient and matching sample_id from the previous step, e.g., UDN123456gene_name
corresponding to an HGNC gene symbol- (Optional)
gene_type
in our case, corresponding toknown_diagnosis
,denovo
, orcomphet
- (Optional)
diagnosis
in our case, corresponding todiagnosis_match_{tentative/likely/certain}_{partial/complete}
- (Optional)
pheno_match
in our case, corresponding to the Phrank similarity score between patient and gene
Run the following to generate a tab-delimited list of cluster assignments and patient details and query genes per cluster:
python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
--query_genes \
--cluster_assignments_file=test_cluster_assignments.tsv \
--candidate_genes_file=test_candidate_genes.tsv \
--query_genes_file=test_cluster_details.tsv \
:cyclone: Step 4: Run gene set enrichment analysis using g:Profiler API
Finally, we run gene set enrichment analysis on the sets of query genes created in the previous step, and we output tab-delimited results.
python /full/path/to/github/repo/RaMeDiES/pathway_analysis.py \
--run_gsea \
--query_genes_file=test_cluster_details.tsv \
--enrichment_output_file=test_cluster_enrichments.tsv