3. Clustering methods - ZYL-Harry/Machine_Learning_study GitHub Wiki

Foreword

  • K-means and K-medoids methods:
    Clusters are groups of data characterized by a small distance to the cluster center. An objective function, typically the sum of the distance to a set of putative cluster centers, is optimized until the best cluster centers candidates are found.

    Problem: these approaches are not able to detect nonspherical clusters because a data point is always assigned to the nearest center; Choosing an appropriate threshold can be non-trivial as well.

  • Distribution-based methods:
    These methods attempt to reproduce the observed realization of data points as a mix of predefined probability distribution functions.

    Problem: the accuracy of such methods depends on the capability of the trial probability to represent the data.

  • DBSCAN (Density-based Spatial Clustering of Applications with Noise):
    DBSCAN can produces clusters with an arbitrary shape. It chooses a density threshold (ε), discards as noise the points in regions with densities lower than this threshold (MinPts), and assigns to different clusters disconnected regions of high density.

    Problem: choosing an appropriate threshold can be non-trivial.

  • Mean shift:
    Each cluster is defined as a set of points that converge to the same local maximum of the density distribution function. This method can find nonspherical clusters.

    Problem: working only for data defined by a set of coordinates and is computationally costly.