Clustering - Statistics-and-Machine-Learning-with-R/Statistical-Methods-and-Machine-Learning-in-R GitHub Wiki

Clustering

Clustering

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Types of Clustering

  • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.

  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s look at them in detail:

  • Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.

  • Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.

  • Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is the Expectation-maximization algorithm which uses multivariate normal distributions.

  • Density Models: These models search the data space for areas of the varied density of data points in the data space. It isolates various different density regions and assigns the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

Now we will be discussing two of the most popular clustering algorithms in detail – K Means clustering and Hierarchical clustering.

K Means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 6 steps :

  1. Specify the desired number of clusters K
  2. Randomly assign each data point to a cluster
  3. Compute cluster centroids
  4. Re-assign each point to the closest cluster centroid
  5. Re-compute cluster centroids
  6. Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

Example of K-Means CLustering with k=2

K Means K Means 1

DBSCAN Clustering

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm that clusters points based on parameters “eps” & “min.Pts”

eps: specifies how close points should be to each other to be considered a part of a cluster
min.Pts: the minimum number of points to form a dense region

  • A point is a core point if it has more than a specified number of points (min.Pts) within eps — These are points that are at the interior of a cluster
  • A border point has fewer than minPts within eps but is in the neighborhood of a core point
  • A noise point is any point that is not a core point nor a border point (outliers).

Example of DBSCAN Clustering with eps = 0.6 & minPts = 2

K Means DBSCAN 1

We can notice there are 2 noise points & 1 border point

Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds a hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

Dendrogram

At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

Two important things that you should know about hierarchical clustering are:

  • This algorithm has been implemented above using a bottom-up approach. It is also possible to follow a top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.

  • The decision of merging two clusters is taken on the basis of the closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :

  1. Euclidean Distance
  2. Manhattan Distance
  3. Minkowski Distance
  4. Hamming Distance

For details regarding these, kindly visit: https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/

Example of Hierarchical Clustering

K Means Single Linkage


Complete Linkage


Single Linkage Dendrogram Complete Linkage Dendrogram

Linkage Criteria for Hierarchical Clustering

Linkage Criteria


Different Linkage Criterion handling variously shaped clusters

Linkage Image Source : https://www.pathtopioneer.com/blog/2020/03/4-1


Difference between K Means and Hierarchical clustering

  • Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).

  • In K Means clustering, since we start with a random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.

  • K Means is found to work well when the shape of the clusters is hyperspherical (like a circle in 2D, the sphere in 3D).

  • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

Silhouette Coefficient: Measuring the goodness of Clustering Results

The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters Silhouette Coefficient SC

This measure has a range of [-1, 1]. Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are:

  • Recommendation engines
  • Market segmentation
  • Social network analysis
  • Search result grouping
  • Medical imaging
  • Image segmentation
  • Anomaly detection

For more details visit: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/#:~:text=Clustering%20is%20the%20task%20of,and%20assign%20them%20into%20clusters. which is also the source.

⚠️ **GitHub.com Fallback** ⚠️