K means Clustering - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki
🧠 K-means Clustering
General Definition
K-means is an unsupervised machine learning algorithm used to group data into K clusters.
It aims to minimize the distance between data points and their cluster centers.
How it works
- Choose the number of clusters K
- Initialize cluster centers
- Assign each data point to the nearest center
- Recompute cluster centers
- Repeat until convergence
Objective
Minimize:
Within-cluster variance
🌳 K-means in Bioinformatics (Phylogenetic Trees)
In this project, the data points are phylogenetic trees, not numerical vectors.
Key adaptation
- Distance between trees = RF distance
- Trees are grouped based on structural similarity
Process in this context
- Compute RF distance between trees
- Apply K-means using this distance
- Group trees with similar topology
Interpretation
Each cluster represents:
- a group of trees sharing similar evolutionary structures
- a consistent evolutionary hypothesis
🔗 RF + K-means Integration
The combination of RF distance and K-means allows:
- comparison of complex tree structures
- grouping of similar evolutionary hypotheses
- reduction of multiple trees into meaningful clusters