Find sub-groups/clusters in data by partitioning into groups s.t. samples within groups are more similar to each compared to samples in other group
Unsupervised learning since no labelled data is available; training data consists only of features
Requires definition of similarity/dissimilarity
Input is a set of features for each sample; output is a cluster/class for each sample.
Used mainly for exploratory data analysis and visualization
Can also be used as part of a hierarchical supervised learning pipeline where regression/classification is done separately for each cluster
K-Means
Group samples into K distinct non-overlapping clusters that minimize within-cluster variation (sum of squares from cluster centroid)
Lloyd's Algorithm for K-Means
Assign some points to be cluster centers (randomly or using k-means++)
E-step: Assign each of the remaining points to one of the clusters based on whichever cluster has the lowest within-cluster sum of squares (same as Euclidean distance => assign to 'nearest' cluster).
M-step: After all points have been assigned, re-compute centroid of each cluster as its mean
Repeat until point assignments don't change or max number of iterations has been reached
Results in a Voronoi diagram with linear decision boundaries
K-Means Assumptions
Data has only K clusters
Clusters are spherical with similar variance
Clusters have similar size
Sum of Squared Errors (variance) is an appropriate clustering metric for the data
All clusters are equally likely
Implementations
scikit-learn: KMeans, Mini-batch K-Means
R: kmeans
Spark: KMeans, BisectingKMeans, StreamingKMeans (generalization of mini-batch K-Means)
Gaussian Mixture Models
GMM is a composite distribution where each point in the distribution is drawn from one of k distinct Gaussian distributions
EM algorithm is used to determine parameters of the k distributions and the probability of each point belonging to the k distributions
Implementations
scikit-learn: Gaussian Mixture Models
Spark: GaussianMixture, GaussianMixtureModel
GMMs vs KMeans
In K-Means, a point must belong to exactly 1 cluster (it is either in or not in a cluster)
GMMs can assign probability of a point belonging to each of the k clusters
GMMs can be considered as a generalization of K-Means (i.e., soft K-Means)
Expectation-Maximization is a generalization of both K-Means and GMMs
Hierarchical Clustering
Can be agglomerative (bottom-up) or divisive (top-down) like bisecting K-Means
Build nested clusters by merging or splitting them successively
Generates all possible clusters from 1 to n in a dendrogram
Implementations
R: hclust, cutree
Spark: BisectingKMeans, BisectingKMeansModel
Affinity Propagation
Creates clusters by sending messages between pairs of samples until convergence
Mean Shift
Discover blobs in smooth density of samples
Spectral Clustering
Low-dim embed affinity matrix between samples, followed by K-Means in low-dim space
DBSCAN
View clusters as area of high density separated by areas of low density
Birch
Builds a Characteristic Feature Tree (CFT) for training data
Power Iteration Clustering
Cluster vertices of a graph when pairwise similarities are given as edge properties
Pairwise similarities should be non-negative and symmetric; missing similarities = 0
Computes pseudo-eigenvector of normalized affinity matrix and uses it to cluster vertices
Input = (src-vertex, dest-vertex, similarity); output = model with cluster assignments