Cluster Validity Indices - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

📊 Cluster Validity Indices (CH and BH)

In clustering, selecting the optimal number of clusters (K) is a critical step.

K-means requires a predefined number of clusters, but the correct value of K is usually unknown.

To solve this problem, cluster validity indices are used.


🔹 Calinski-Harabasz Index (CH)

The Calinski-Harabasz index evaluates clustering quality based on:

  • Separation between clusters
  • Compactness within clusters

Objective

Maximize:

Between-cluster variance / Within-cluster variance

Interpretation

  • Higher CH value → better clustering
  • Indicates well-separated and compact clusters

🔹 Ball-Hall Index (BH)

The Ball-Hall index focuses on:

  • Compactness of clusters

Objective

Minimize:

  • Average variance within clusters

Interpretation

  • Lower BH value → better clustering
  • Indicates tight and homogeneous clusters

🌳 CH and BH in This Project

In this project:

  1. K-means is executed for multiple values of K (from Kmin to Kmax)
  2. For each K, CH and BH indices are computed
  3. The optimal number of clusters is selected based on these indices

🎯 Role in the Workflow

CH and BH allow:

  • Automatic selection of the best number of clusters
  • Avoiding arbitrary choice of K
  • Improving the reliability of clustering results