3. Clustering methods - ZYL-Harry/Machine_Learning_study GitHub Wiki

Foreword

K-means and K-medoids methods:
Clusters are groups of data characterized by a small distance to the cluster center. An objective function, typically the sum of the distance to a set of putative cluster centers, is optimized until the best cluster centers candidates are found.

Problem: these approaches are not able to detect nonspherical clusters because a data point is always assigned to the nearest center; Choosing an appropriate threshold can be non-trivial as well.
Distribution-based methods:
These methods attempt to reproduce the observed realization of data points as a mix of predefined probability distribution functions.

Problem: the accuracy of such methods depends on the capability of the trial probability to represent the data.
DBSCAN (Density-based Spatial Clustering of Applications with Noise):
DBSCAN can produces clusters with an arbitrary shape. It chooses a density threshold (ε), discards as noise the points in regions with densities lower than this threshold (MinPts), and assigns to different clusters disconnected regions of high density.

Problem: choosing an appropriate threshold can be non-trivial.
Mean shift:
Each cluster is defined as a set of points that converge to the same local maximum of the density distribution function. This method can find nonspherical clusters.

Problem: working only for data defined by a set of coordinates and is computationally costly.