Cluster Optimization - hisl6802/ClusteringToolbox GitHub Wiki

Mono-clustering optimization

A key aspect of correctly applying a clustering algorithm is the selection of the number of clusters to analyze. This functionality allows the user to create a set mono-clustering solutions (2-N clusters) for a single agglomerative hierarchical clustering algorithm, and finds the optimal number. The available optimization algorithms can be found below.

Example Input

mz	S1	S2	S3	S4	S5	S6	S7	rtmed
	G1	G1	G1	G1	G2	G2	G2
60.04	$I_{11}$	$I_{12}$	$I_{13}$	$I_{14}$	$I_{15}$	$I_{16}$	$I_{17}$	3.21
61.04	$I_{21}$	$I_{22}$	$I_{23}$	$I_{24}$	$I_{25}$	$I_{26}$	$I_{27}$	3.62
.				.				.
.				.				.
.				.				.
62.04	$I_{31}$	$I_{32}$	$I_{33}$	$I_{34}$	$I_{35}$	$I_{36}$	$I_{37}$	3.62
69.99	$I_{41}$	$I_{42}$	$I_{43}$	$I_{44}$	$I_{45}$	$I_{46}$	$I_{47}$	9.33

Silhouette score (docs)

$$SI(k) = \frac{1}{k} \sum_{h=1}^k SI_h$$

where, $$SI_h = \frac{1}{\lvert C_H \rvert} \sum_{i=1}^{\lvert C_H \rvert} \frac{b_i^h - a_i^h}{max(a_i^h,b_i^h)}$$

$$ a_i^h = \frac{1}{\lvert C_H \rvert -1} \sum_{l=1,l \neq i}^{\lvert C_H \rvert} d(x_i^h,x_l^h)$$

$$ b_i^h = min_{j \in (1,..k);j\neq h} \frac{1}{\lvert C_j \rvert} \sum_{l=1}^{ \lvert C_j \rvert} d(x_i^h,x_l^h)$$

Davies-Bouldin Index (docs)

$$DBI(k) = \frac{1}{k} \sum_{h=1}^k F_{C_H}$$

$$F_{C_H} = max_{C_j \neq C_h} F_{C_HC_j}$$

$$F_{C_HC_j} = \frac{f_1(C_h) + f_1(C_j)}{f_2(C_h,C_j)}$$

where, $f_1$ is the average distance between points and centroid and $f_2$ is the distance between centroids.

Calinski-Harabasz (Pseudo F-statistic) (docs)

$$ CH(k) = \frac{Inter-cluster\ separation}{Intra-cluster\ separation}$$

$$ Inter-cluster\ separation = (\sum_{i=1}^{K} \lvert C_i \rvert d(v_i,v)^2)/ (K-1) $$

$$ Intra-cluster\ separation = (\sum_{i=1}^{K} \sum_{x \in C_i} d(\mathbf{x},v_i)^2)/ (n-K) $$