K means Clustering - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

🧠 K-means Clustering

General Definition

K-means is an unsupervised machine learning algorithm used to group data into K clusters.

It aims to minimize the distance between data points and their cluster centers.


How it works

  1. Choose the number of clusters K
  2. Initialize cluster centers
  3. Assign each data point to the nearest center
  4. Recompute cluster centers
  5. Repeat until convergence

Objective

Minimize:

Within-cluster variance

🌳 K-means in Bioinformatics (Phylogenetic Trees)

In this project, the data points are phylogenetic trees, not numerical vectors.


Key adaptation

  • Distance between trees = RF distance
  • Trees are grouped based on structural similarity

Process in this context

  1. Compute RF distance between trees
  2. Apply K-means using this distance
  3. Group trees with similar topology

Interpretation

Each cluster represents:

  • a group of trees sharing similar evolutionary structures
  • a consistent evolutionary hypothesis

🔗 RF + K-means Integration

The combination of RF distance and K-means allows:

  • comparison of complex tree structures
  • grouping of similar evolutionary hypotheses
  • reduction of multiple trees into meaningful clusters