Study Case - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

๐Ÿงช Study Case: Clustering of COVID-19 Phylogenetic Trees


๐Ÿ“ฅ Input Data

The dataset used in this study is:

  • Covid-19_trees.txt

This file contains multiple phylogenetic trees in Newick format, representing different possible evolutionary relationships between coronavirus strains.

Example (excerpt):

(Guangxi_Pangolin_P2V,...,Hu_Wuhan_2020,...)

๐Ÿ‘‰ These trees include:

  • Human COVID-19 sequences (Wuhan, USA, Italy, Australia)
  • Bat coronaviruses (RaTG13, Bat-CoV)
  • Pangolin coronaviruses
  • Other related viruses (SARS, MERS)

:contentReference[oaicite:0]{index=0}


โš™๏ธ Execution Parameters

The application was executed using:

  • Cluster validity index: CH (Calinski-Harabasz)
  • ฮฑ (penalty parameter): 0.2
  • K range: [3, 8]

๐Ÿ“Š Results (stat.csv)

CH;12;...;4;4;1.000;...;part(1 <> 1 <> 1 <> 3 <> ... <> 4)

Interpretation

  • Optimal number of clusters: K = 4
  • Perfect clustering score: 1.000
  • Partition distribution:
part(1 <> 1 <> 1 <> 3 <> 1 <> 1 <> 1 <> 1 <> 1 <> 1 <> 2 <> 4)

๐Ÿ‘‰ This means:

  • Most trees belong to Cluster 1
  • Some trees belong to clusters 2, 3, and 4

๐ŸŒณ Cluster Analysis (output.txt)

Cluster #1 (Dominant Cluster)

This cluster contains the majority of trees:

  • T1, T2, T3, T5, T6, T7, T8 ...

:contentReference[oaicite:1]{index=1}

Biological Interpretation

These trees share a common evolutionary structure:

  • Human COVID-19 strains are grouped together
  • Close relationship with RaTG13 (bat virus)
  • Pangolin viruses appear as intermediate

๐Ÿ‘‰ This cluster represents:

The dominant evolutionary hypothesis of SARS-CoV-2


Other Clusters

Clusters 2, 3, and 4 contain fewer trees.

๐Ÿ‘‰ Interpretation:

  • Alternative evolutionary scenarios
  • Possible variations due to:
    • different tree inference methods
    • uncertainty in data
    • genetic variability

๐Ÿ”ฌ Scientific Interpretation

This study highlights an important concept:

Multiple possible evolutionary trees exist

Because:

  • Different datasets
  • Different inference algorithms
  • Biological uncertainty

Role of KMPTC

KMPTC helps to:

  • Group similar trees using RF distance
  • Identify dominant evolutionary patterns
  • Reduce complexity from many trees to a few clusters

๐ŸŒ Real-World Insight

From this clustering:

  • A main evolutionary pattern emerges (Cluster 1)
  • Several alternative hypotheses exist (Clusters 2โ€“4)

Example interpretation

  • COVID-19 strains are closely related worldwide
  • Strong evolutionary link with bat coronavirus (RaTG13)
  • Pangolin viruses may play a role in intermediate evolution

๐ŸŽฏ Conclusion

This case study demonstrates that:

  • K-means + RF distance effectively groups phylogenetic trees
  • Clustering reveals dominant evolutionary scenarios
  • Supertrees can summarize complex biological relationships

๐Ÿ‘‰ The method transforms:

Many uncertain trees โ†’ Few meaningful evolutionary patterns


๐Ÿ’ก Key Takeaways

  • RF distance measures structural differences between trees
  • K-means groups similar evolutionary hypotheses
  • CH index selects the optimal number of clusters
  • Each cluster represents a biological interpretation
  • Supertrees summarize evolutionary patterns
โš ๏ธ **GitHub.com Fallback** โš ๏ธ