Study Case - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki
The dataset used in this study is:
Covid-19_trees.txt
This file contains multiple phylogenetic trees in Newick format, representing different possible evolutionary relationships between coronavirus strains.
Example (excerpt):
(Guangxi_Pangolin_P2V,...,Hu_Wuhan_2020,...)
๐ These trees include:
- Human COVID-19 sequences (Wuhan, USA, Italy, Australia)
- Bat coronaviruses (RaTG13, Bat-CoV)
- Pangolin coronaviruses
- Other related viruses (SARS, MERS)
:contentReference[oaicite:0]{index=0}
The application was executed using:
- Cluster validity index: CH (Calinski-Harabasz)
- ฮฑ (penalty parameter): 0.2
- K range: [3, 8]
CH;12;...;4;4;1.000;...;part(1 <> 1 <> 1 <> 3 <> ... <> 4)
- Optimal number of clusters: K = 4
- Perfect clustering score: 1.000
- Partition distribution:
part(1 <> 1 <> 1 <> 3 <> 1 <> 1 <> 1 <> 1 <> 1 <> 1 <> 2 <> 4)
๐ This means:
- Most trees belong to Cluster 1
- Some trees belong to clusters 2, 3, and 4
This cluster contains the majority of trees:
- T1, T2, T3, T5, T6, T7, T8 ...
:contentReference[oaicite:1]{index=1}
These trees share a common evolutionary structure:
- Human COVID-19 strains are grouped together
- Close relationship with RaTG13 (bat virus)
- Pangolin viruses appear as intermediate
๐ This cluster represents:
The dominant evolutionary hypothesis of SARS-CoV-2
Clusters 2, 3, and 4 contain fewer trees.
๐ Interpretation:
- Alternative evolutionary scenarios
- Possible variations due to:
- different tree inference methods
- uncertainty in data
- genetic variability
This study highlights an important concept:
Because:
- Different datasets
- Different inference algorithms
- Biological uncertainty
KMPTC helps to:
- Group similar trees using RF distance
- Identify dominant evolutionary patterns
- Reduce complexity from many trees to a few clusters
From this clustering:
- A main evolutionary pattern emerges (Cluster 1)
- Several alternative hypotheses exist (Clusters 2โ4)
- COVID-19 strains are closely related worldwide
- Strong evolutionary link with bat coronavirus (RaTG13)
- Pangolin viruses may play a role in intermediate evolution
This case study demonstrates that:
- K-means + RF distance effectively groups phylogenetic trees
- Clustering reveals dominant evolutionary scenarios
- Supertrees can summarize complex biological relationships
๐ The method transforms:
Many uncertain trees โ Few meaningful evolutionary patterns
- RF distance measures structural differences between trees
- K-means groups similar evolutionary hypotheses
- CH index selects the optimal number of clusters
- Each cluster represents a biological interpretation
- Supertrees summarize evolutionary patterns