3. Clustering methods - ZYL-Harry/Machine_Learning_study GitHub Wiki
Foreword
- K-means and K-medoids methods:
Clusters are groups of data characterized by a small distance to the cluster center. An objective function, typically the sum of the distance to a set of putative cluster centers, is optimized until the best cluster centers candidates are found.Problem: these approaches are not able to detect nonspherical clusters because a data point is always assigned to the nearest center; Choosing an appropriate threshold can be non-trivial as well.
- Distribution-based methods:
These methods attempt to reproduce the observed realization of data points as a mix of predefined probability distribution functions.Problem: the accuracy of such methods depends on the capability of the trial probability to represent the data.
- DBSCAN (Density-based Spatial Clustering of Applications with Noise):
DBSCAN can produces clusters with an arbitrary shape. It chooses a density threshold (ε), discards as noise the points in regions with densities lower than this threshold (MinPts), and assigns to different clusters disconnected regions of high density.Problem: choosing an appropriate threshold can be non-trivial.
- Mean shift:
Each cluster is defined as a set of points that converge to the same local maximum of the density distribution function. This method can find nonspherical clusters.Problem: working only for data defined by a set of coordinates and is computationally costly.