Quick Start - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

Quick Start

git clone https://github.com/tahiri-lab/KMeansPhylogeneticTreesClustering.git
cd src
make install
./KMPTC -tree ../data/your_file.txt 1 0.2 3 8

Installation

Requirements

  • git 2.35.1+
  • macOS (tested on Monterey 12.5)

Steps

git clone https://github.com/tahiri-lab/KMeansPhylogeneticTreesClustering.git
cd KMeansPhylogeneticTreesClustering/src
make install

Help command

make help

Usage

Command

./KMPTC -tree input_file cluster_validity_index α Kmin Kmax

Description

This command clusters phylogenetic trees using K-means and returns:

  • Clusters of trees
  • Optimal number of clusters
  • Supertrees

Parameters

input_file

Path to the input file (Newick format)

cluster_validity_index

  • 1 → Calinski-Harabasz (CH)
  • 2 → Ball-Hall (BH)

α (alpha)

Penalty parameter for species overlap
Range: 0 to 1

Kmin

Minimum number of clusters

  • CH → Kmin ≥ 2
  • BH → Kmin ≥ 1

Kmax

Maximum number of clusters

Constraint:
Kmax ≤ N - 1
(N = number of trees)


Choosing the best K

Two indices are used:

Calinski-Harabasz (CH)

  • Maximizes separation between clusters
  • Good when clusters are well separated

Ball-Hall (BH)

  • Minimizes variance inside clusters
  • Focuses on compact clusters

Supertrees

For each cluster:

  • A supertree is inferred
  • Represents the group of trees

Input / Output

Input

  • Located in data/
  • Must be in Newick format

Output

Generated in output/

Files

  • stat.csv

    • Clustering statistics
  • output.txt

    • Cluster content

Example

Example

./KMPTC -tree ../data/Covid-19_trees.txt 1 0.2 3 8

Using Makefile

make execute

FAQ

Why choose CH or BH?

  • CH → better for separated clusters
  • BH → better for compact clusters

What is α?

Controls penalty for species overlap between trees.


Why choose Kmin and Kmax?

K-means requires a predefined number of clusters.
CH/BH help determine the optimal K.