Cluster Optimization - hisl6802/ClusteringToolbox GitHub Wiki

Mono-clustering optimization

A key aspect of correctly applying a clustering algorithm is the selection of the number of clusters to analyze. This functionality allows the user to create a set mono-clustering solutions (2-N clusters) for a single agglomerative hierarchical clustering algorithm, and finds the optimal number. The available optimization algorithms can be found below.

Example Input

mz S1 S2 S3 S4 S5 S6 S7 rtmed
G1 G1 G1 G1 G2 G2 G2
60.04 $I_{11}$ $I_{12}$ $I_{13}$ $I_{14}$ $I_{15}$ $I_{16}$ $I_{17}$ 3.21
61.04 $I_{21}$ $I_{22}$ $I_{23}$ $I_{24}$ $I_{25}$ $I_{26}$ $I_{27}$ 3.62
. . .
. . .
. . .
62.04 $I_{31}$ $I_{32}$ $I_{33}$ $I_{34}$ $I_{35}$ $I_{36}$ $I_{37}$ 3.62
69.99 $I_{41}$ $I_{42}$ $I_{43}$ $I_{44}$ $I_{45}$ $I_{46}$ $I_{47}$ 9.33

Silhouette score (docs)

$$SI(k) = \frac{1}{k} \sum_{h=1}^k SI_h$$

where, $$SI_h = \frac{1}{\lvert C_H \rvert} \sum_{i=1}^{\lvert C_H \rvert} \frac{b_i^h - a_i^h}{max(a_i^h,b_i^h)}$$

$$ a_i^h = \frac{1}{\lvert C_H \rvert -1} \sum_{l=1,l \neq i}^{\lvert C_H \rvert} d(x_i^h,x_l^h)$$

$$ b_i^h = min_{j \in (1,..k);j\neq h} \frac{1}{\lvert C_j \rvert} \sum_{l=1}^{ \lvert C_j \rvert} d(x_i^h,x_l^h)$$

Davies-Bouldin Index (docs)

$$DBI(k) = \frac{1}{k} \sum_{h=1}^k F_{C_H}$$

$$F_{C_H} = max_{C_j \neq C_h} F_{C_HC_j}$$

$$F_{C_HC_j} = \frac{f_1(C_h) + f_1(C_j)}{f_2(C_h,C_j)}$$

where, $f_1$ is the average distance between points and centroid and $f_2$ is the distance between centroids.

Calinski-Harabasz (Pseudo F-statistic) (docs)

$$ CH(k) = \frac{Inter-cluster\ separation}{Intra-cluster\ separation}$$

$$ Inter-cluster\ separation = (\sum_{i=1}^{K} \lvert C_i \rvert d(v_i,v)^2)/ (K-1) $$

$$ Intra-cluster\ separation = (\sum_{i=1}^{K} \sum_{x \in C_i} d(\mathbf{x},v_i)^2)/ (n-K) $$