Quick Start - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki
Quick Start
git clone https://github.com/tahiri-lab/KMeansPhylogeneticTreesClustering.git
cd src
make install
./KMPTC -tree ../data/your_file.txt 1 0.2 3 8
Installation
Requirements
- git 2.35.1+
- macOS (tested on Monterey 12.5)
Steps
git clone https://github.com/tahiri-lab/KMeansPhylogeneticTreesClustering.git
cd KMeansPhylogeneticTreesClustering/src
make install
Help command
make help
Usage
Command
./KMPTC -tree input_file cluster_validity_index α Kmin Kmax
Description
This command clusters phylogenetic trees using K-means and returns:
- Clusters of trees
- Optimal number of clusters
- Supertrees
Parameters
input_file
Path to the input file (Newick format)
cluster_validity_index
- 1 → Calinski-Harabasz (CH)
- 2 → Ball-Hall (BH)
α (alpha)
Penalty parameter for species overlap
Range: 0 to 1
Kmin
Minimum number of clusters
- CH → Kmin ≥ 2
- BH → Kmin ≥ 1
Kmax
Maximum number of clusters
Constraint:
Kmax ≤ N - 1
(N = number of trees)
Choosing the best K
Two indices are used:
Calinski-Harabasz (CH)
- Maximizes separation between clusters
- Good when clusters are well separated
Ball-Hall (BH)
- Minimizes variance inside clusters
- Focuses on compact clusters
Supertrees
For each cluster:
- A supertree is inferred
- Represents the group of trees
Input / Output
Input
- Located in
data/ - Must be in Newick format
Output
Generated in output/
Files
-
stat.csv- Clustering statistics
-
output.txt- Cluster content
Example
Example
./KMPTC -tree ../data/Covid-19_trees.txt 1 0.2 3 8
Using Makefile
make execute
FAQ
Why choose CH or BH?
- CH → better for separated clusters
- BH → better for compact clusters
What is α?
Controls penalty for species overlap between trees.
Why choose Kmin and Kmax?
K-means requires a predefined number of clusters.
CH/BH help determine the optimal K.