Clustering K Means vs EM (Expectation Maximization) - SoojungHong/MachineLearning GitHub Wiki

Clustering algorithm overview

Clustering can be considered the most important unsupervised learning problem, and – like every other problem of this kind – it aims to find a structure (intrinsic grouping) in a collection of unlabelled data. A cluster is therefore a collection of objects which are ‘similar’ between each other and are ‘dissimilar’ to the objects belonging to other clusters. Another kind of clustering is conceptual clustering in which two or more objects are considered to belong to the same cluster if it defines a concept common to all these objects. That is, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

An important question is how to decide what constitutes good clustering, since it is commonly acknowledged that there is no absolute ‘best’ criterion which would be independent of the final aim of the clustering.

There are different types of clustering, which have been extensively reviewed. Briefly, one approach is to group data in an exclusive way, so that if a certain item of data belongs to a definite cluster, then it could not be included in another cluster. Another approach, the so-called overlapping clustering, uses fuzzy sets to cluster data in such a way that each item of data may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value. Alternatively, in the third approach (hierarchical clustering), the algorithm begins by setting each item of data as a cluster and proceeds by uniting the two nearest clusters. After a few iterations it reaches the final clusters wanted. Finally, the fourth kind of clustering uses a completely probabilistic approach. We examined the performance of two of the most used clustering algorithms: K-means and EM as follows.

K-means algorithm

The cluster analysis procedure is analysed to determine the properties of the data set and the target variable. It is typically used to determine how to measure similarity distance. Basically, it functions as follows:

Input: The number of k and a database containing n objects.

Output: A set of k-clusters that minimize the squared-error criterion. Method:

arbitrarily choose k objects as the initial cluster centres;
repeat;
(re)assign each object to the cluster to which the object is the most similar based on the mean value of the objects in the cluster;
update the cluster mean, i.e. calculate the mean value of the object for each cluster;
until no change.

EM-algorithm

EM clustering

The concept of the EM algorithm stems from the Gaussian mixture model (GMM). The GMM method is one way to improve the density of a given set of sample data modeled as a function of the probability density of a single-density estimation method with multiple Gaussian probability density function to model the distribution of the data. In general, to obtain the estimated parameters of each Gaussian blend component if given a sample data set of the log-likelihood of the data, the maximum is determined by the EM algorithm to estimate the optimal model. Principally, the EM clustering method uses the following algorithm:

Input: Cluster number k, a database, stopping tolerance.

Output: A set of k-clusters with weight that maximize log-likelihood function.

Expectation step: For each database record x, compute the membership probability of x in each cluster h = 1,…, k.
Maximization step: Update mixture model parameter (probability weight).
Stopping criteria: If stopping criteria are satisfied stop, else set j = j +1 and go to (1).

In the analytical methods available to achieve probability distribution parameters, in all probability the value of the variable is given. The iterative EM algorithm uses a random variable and, eventually, is a general method to find the optimal parameters of the hidden distribution function from the given data, when the data are incomplete or has missing values.

How K-means and EM algorithms are different?

k-means is a variant of EM, with the assumptions that clusters are spherical.

K means

Hard assign a data point to one particular cluster on convergence.
It makes use of the L2 norm when optimizing (Min {Theta} L2 norm point and its centroid coordinates).

Soft assigns a point to clusters (so it give a probability of any point belonging to any centroid).
It doesn't depend on the L2 norm, but is based on the Expectation, i.e., the probability of the point belonging to a particular cluster. This makes K-means biased towards spherical clusters.