K means Clustering - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

🧠 K-means Clustering

General Definition

K-means is an unsupervised machine learning algorithm used to group data into K clusters.

It aims to minimize the distance between data points and their cluster centers.

How it works

Choose the number of clusters K
Initialize cluster centers
Assign each data point to the nearest center
Recompute cluster centers
Repeat until convergence

Objective

Minimize:

Within-cluster variance

🌳 K-means in Bioinformatics (Phylogenetic Trees)

In this project, the data points are phylogenetic trees, not numerical vectors.

Key adaptation

Distance between trees = RF distance
Trees are grouped based on structural similarity

Process in this context

Compute RF distance between trees
Apply K-means using this distance
Group trees with similar topology

Interpretation

Each cluster represents:

a group of trees sharing similar evolutionary structures
a consistent evolutionary hypothesis

🔗 RF + K-means Integration

The combination of RF distance and K-means allows:

comparison of complex tree structures
grouping of similar evolutionary hypotheses
reduction of multiple trees into meaningful clusters