metric k means - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki
🧠 K-means Clustering Metric
🔬 Definition
K-means minimizes the distance between data points and their cluster centers.
🧠 Objective Function
K-means minimizes:
J = Σ ||xi - μk||²
Where:
- xi = data point
- μk = centroid of cluster k
⚠️ Important Adaptation (This Project)
In classical K-means:
- data = vectors
- distance = Euclidean
In this project:
- data = phylogenetic trees
- distance = RF distance
📐 Modified Objective
Instead of Euclidean distance:
Minimize sum of RF distances within clusters
🔁 Iterative Process
- Assign trees to nearest cluster (RF distance)
- Update cluster representatives
- Repeat until convergence
📊 Interpretation
- Low intra-cluster RF → trees are similar
- High inter-cluster RF → clusters are distinct
📌 Example (Conceptual)
Given 3 trees:
- T1 close to T2 → same cluster
- T3 very different → separate cluster
🌳 Biological Meaning
Each cluster represents:
- a group of trees with similar topology
- a consistent evolutionary hypothesis
🔗 Relationship with RF
- RF defines similarity
- K-means uses this similarity to group trees