metric k means - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

🧠 K-means Clustering Metric

🔬 Definition

K-means minimizes the distance between data points and their cluster centers.


🧠 Objective Function

K-means minimizes:

J = Σ ||xi - μk||²

Where:

  • xi = data point
  • μk = centroid of cluster k

⚠️ Important Adaptation (This Project)

In classical K-means:

  • data = vectors
  • distance = Euclidean

In this project:

  • data = phylogenetic trees
  • distance = RF distance

📐 Modified Objective

Instead of Euclidean distance:

Minimize sum of RF distances within clusters

🔁 Iterative Process

  1. Assign trees to nearest cluster (RF distance)
  2. Update cluster representatives
  3. Repeat until convergence

📊 Interpretation

  • Low intra-cluster RF → trees are similar
  • High inter-cluster RF → clusters are distinct

📌 Example (Conceptual)

Given 3 trees:

  • T1 close to T2 → same cluster
  • T3 very different → separate cluster

🌳 Biological Meaning

Each cluster represents:

  • a group of trees with similar topology
  • a consistent evolutionary hypothesis

🔗 Relationship with RF

  • RF defines similarity
  • K-means uses this similarity to group trees