workflow - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

🔄 Application Workflow

boop drawio

The steps of the application are as follows:


1) Input: Selecting Phylogenetic Trees

Phylogenetic trees are provided as input using a .txt file.

Requirements

  • Trees must strictly follow the Newick format
  • Each tree represents a possible evolutionary hypothesis

Interpretation

These trees constitute the dataset to be processed by the application.


2) Preprocessing

Before clustering:

  • The program reads and parses the Newick trees
  • Validates the structure of each tree
  • Extracts the topology (splits)

3) Distance Computation (Core Step) 📐

The similarity between trees is computed using the Robinson-Foulds (RF) distance.

Key idea

  • Each tree is represented as a set of splits
  • RF distance counts the number of different splits between two trees

Formula

RF(T1, T2) = (splits in T1 not in T2) + (splits in T2 not in T1)

Properties

  • RF = 0 → trees are identical
  • Higher RF → trees are more different
  • Only topology is considered (not branch lengths)

4) Clustering using K-means

The K-means algorithm is applied to group similar trees.

Process ⚙️

  1. Select a number of clusters K (between Kmin and Kmax)
  2. Assign each tree to the closest cluster using RF distance
  3. Update cluster centers
  4. Repeat until convergence

5) Optimal Cluster Selection

Since K-means requires a predefined K, the application evaluates multiple values of K.

Two cluster validity indices are used:

Calinski-Harabasz (CH)

  • Maximizes separation between clusters

Ball-Hall (BH)

  • Minimizes variance within clusters

👉 The optimal K is selected based on these criteria.


6) Supertree Inference

For each cluster:

  • A supertree is inferred
  • It represents the consensus or dominant evolutionary structure of the cluster

Interpretation

Each supertree corresponds to a main evolutionary pattern


7) Output Generation

The results are generated in two formats:

📄 output.txt

  • Contains cluster assignments
  • Shows which trees belong to each cluster

📊 stat.csv

  • Contains clustering statistics
  • Includes evaluation metrics for different K values

🌍 Global Interpretation

The workflow transforms:

  • A large set of possible phylogenetic trees
    ➡️ into
  • A small number of meaningful evolutionary patterns

Summary

  • Input = multiple evolutionary hypotheses
  • RF distance = measure of structural difference
  • K-means = grouping similar hypotheses
  • Supertrees = representative evolutionary scenarios

🎯 Purpose

This workflow allows researchers to:

  • Identify dominant evolutionary patterns
  • Reduce complexity in phylogenetic analysis
  • Compare alternative evolutionary hypotheses
  • Improve interpretation of biological data

to see a diagram, of the workflow consult diagram