Cluster Optimization - hisl6802/ClusteringToolbox GitHub Wiki
Mono-clustering optimization
A key aspect of correctly applying a clustering algorithm is the selection of the number of clusters to analyze. This functionality allows the user to create a set mono-clustering solutions (2-N clusters) for a single agglomerative hierarchical clustering algorithm, and finds the optimal number. The available optimization algorithms can be found below.
Example Input
| mz | S1 | S2 | S3 | S4 | S5 | S6 | S7 | rtmed |
|---|---|---|---|---|---|---|---|---|
| G1 | G1 | G1 | G1 | G2 | G2 | G2 | ||
| 60.04 | $I_{11}$ | $I_{12}$ | $I_{13}$ | $I_{14}$ | $I_{15}$ | $I_{16}$ | $I_{17}$ | 3.21 |
| 61.04 | $I_{21}$ | $I_{22}$ | $I_{23}$ | $I_{24}$ | $I_{25}$ | $I_{26}$ | $I_{27}$ | 3.62 |
| . | . | . | ||||||
| . | . | . | ||||||
| . | . | . | ||||||
| 62.04 | $I_{31}$ | $I_{32}$ | $I_{33}$ | $I_{34}$ | $I_{35}$ | $I_{36}$ | $I_{37}$ | 3.62 |
| 69.99 | $I_{41}$ | $I_{42}$ | $I_{43}$ | $I_{44}$ | $I_{45}$ | $I_{46}$ | $I_{47}$ | 9.33 |
Silhouette score (docs)
$$SI(k) = \frac{1}{k} \sum_{h=1}^k SI_h$$
where, $$SI_h = \frac{1}{\lvert C_H \rvert} \sum_{i=1}^{\lvert C_H \rvert} \frac{b_i^h - a_i^h}{max(a_i^h,b_i^h)}$$
$$ a_i^h = \frac{1}{\lvert C_H \rvert -1} \sum_{l=1,l \neq i}^{\lvert C_H \rvert} d(x_i^h,x_l^h)$$
$$ b_i^h = min_{j \in (1,..k);j\neq h} \frac{1}{\lvert C_j \rvert} \sum_{l=1}^{ \lvert C_j \rvert} d(x_i^h,x_l^h)$$
Davies-Bouldin Index (docs)
$$DBI(k) = \frac{1}{k} \sum_{h=1}^k F_{C_H}$$
$$F_{C_H} = max_{C_j \neq C_h} F_{C_HC_j}$$
$$F_{C_HC_j} = \frac{f_1(C_h) + f_1(C_j)}{f_2(C_h,C_j)}$$
where, $f_1$ is the average distance between points and centroid and $f_2$ is the distance between centroids.
Calinski-Harabasz (Pseudo F-statistic) (docs)
$$ CH(k) = \frac{Inter-cluster\ separation}{Intra-cluster\ separation}$$
$$ Inter-cluster\ separation = (\sum_{i=1}^{K} \lvert C_i \rvert d(v_i,v)^2)/ (K-1) $$
$$ Intra-cluster\ separation = (\sum_{i=1}^{K} \sum_{x \in C_i} d(\mathbf{x},v_i)^2)/ (n-K) $$