workflow - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki
🔄 Application Workflow
The steps of the application are as follows:
1) Input: Selecting Phylogenetic Trees
Phylogenetic trees are provided as input using a .txt file.
Requirements
- Trees must strictly follow the Newick format
- Each tree represents a possible evolutionary hypothesis
Interpretation
These trees constitute the dataset to be processed by the application.
2) Preprocessing
Before clustering:
- The program reads and parses the Newick trees
- Validates the structure of each tree
- Extracts the topology (splits)
3) Distance Computation (Core Step) 📐
The similarity between trees is computed using the Robinson-Foulds (RF) distance.
Key idea
- Each tree is represented as a set of splits
- RF distance counts the number of different splits between two trees
Formula
RF(T1, T2) = (splits in T1 not in T2) + (splits in T2 not in T1)
Properties
- RF = 0 → trees are identical
- Higher RF → trees are more different
- Only topology is considered (not branch lengths)
4) Clustering using K-means
The K-means algorithm is applied to group similar trees.
Process ⚙️
- Select a number of clusters K (between Kmin and Kmax)
- Assign each tree to the closest cluster using RF distance
- Update cluster centers
- Repeat until convergence
5) Optimal Cluster Selection
Since K-means requires a predefined K, the application evaluates multiple values of K.
Two cluster validity indices are used:
Calinski-Harabasz (CH)
- Maximizes separation between clusters
Ball-Hall (BH)
- Minimizes variance within clusters
👉 The optimal K is selected based on these criteria.
6) Supertree Inference
For each cluster:
- A supertree is inferred
- It represents the consensus or dominant evolutionary structure of the cluster
Interpretation
Each supertree corresponds to a main evolutionary pattern
7) Output Generation
The results are generated in two formats:
📄 output.txt
- Contains cluster assignments
- Shows which trees belong to each cluster
📊 stat.csv
- Contains clustering statistics
- Includes evaluation metrics for different K values
🌍 Global Interpretation
The workflow transforms:
- A large set of possible phylogenetic trees
➡️ into - A small number of meaningful evolutionary patterns
Summary
- Input = multiple evolutionary hypotheses
- RF distance = measure of structural difference
- K-means = grouping similar hypotheses
- Supertrees = representative evolutionary scenarios
🎯 Purpose
This workflow allows researchers to:
- Identify dominant evolutionary patterns
- Reduce complexity in phylogenetic analysis
- Compare alternative evolutionary hypotheses
- Improve interpretation of biological data
to see a diagram, of the workflow consult diagram