workflow - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

🔄 Application Workflow

boop drawio

The steps of the application are as follows:

1) Input: Selecting Phylogenetic Trees

Phylogenetic trees are provided as input using a .txt file.

Requirements

Trees must strictly follow the Newick format
Each tree represents a possible evolutionary hypothesis

Interpretation

These trees constitute the dataset to be processed by the application.

2) Preprocessing

Before clustering:

The program reads and parses the Newick trees
Validates the structure of each tree
Extracts the topology (splits)

3) Distance Computation (Core Step) 📐

The similarity between trees is computed using the Robinson-Foulds (RF) distance.

Key idea

Each tree is represented as a set of splits
RF distance counts the number of different splits between two trees

Formula

RF(T1, T2) = (splits in T1 not in T2) + (splits in T2 not in T1)

Properties

RF = 0 → trees are identical
Higher RF → trees are more different
Only topology is considered (not branch lengths)

4) Clustering using K-means

The K-means algorithm is applied to group similar trees.

Process ⚙️

Select a number of clusters K (between Kmin and Kmax)
Assign each tree to the closest cluster using RF distance
Update cluster centers
Repeat until convergence

5) Optimal Cluster Selection

Since K-means requires a predefined K, the application evaluates multiple values of K.

Two cluster validity indices are used:

Calinski-Harabasz (CH)

Maximizes separation between clusters

Ball-Hall (BH)

Minimizes variance within clusters

👉 The optimal K is selected based on these criteria.

6) Supertree Inference

For each cluster:

A supertree is inferred
It represents the consensus or dominant evolutionary structure of the cluster

Interpretation

Each supertree corresponds to a main evolutionary pattern

7) Output Generation

The results are generated in two formats:

📄 output.txt

Contains cluster assignments
Shows which trees belong to each cluster

📊 stat.csv

Contains clustering statistics
Includes evaluation metrics for different K values

🌍 Global Interpretation

The workflow transforms:

A large set of possible phylogenetic trees
➡️ into
A small number of meaningful evolutionary patterns

Summary

Input = multiple evolutionary hypotheses
RF distance = measure of structural difference
K-means = grouping similar hypotheses
Supertrees = representative evolutionary scenarios

🎯 Purpose

This workflow allows researchers to:

Identify dominant evolutionary patterns
Reduce complexity in phylogenetic analysis
Compare alternative evolutionary hypotheses
Improve interpretation of biological data

to see a diagram, of the workflow consult diagram