Fuzzy c means clustering algorithm - rugbyprof/5443-Data-Mining GitHub Wiki
The Algorithm
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function:
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by:
Example
To better understand this principle, a classic example of mono-dimensional data is given below on an x axis.
This data set can be traditionally grouped into two clusters. By selecting a threshold on the x-axis, the data is separated into two clusters. The resulting clusters are labelled 'A' and 'B', as seen in the following image. Each point belonging to the data set would therefore have a membership coefficient of 1 or 0. This membership coefficient of each corresponding data point is represented by the inclusion of the y-axis.
In fuzzy clustering, each data point can have membership to multiple clusters. By relaxing the definition of membership coefficients from strictly 1 or 0, these values can range from any value from 1 to 0. The following image shows the data set from the previous clustering, but now fuzzy c-means clustering is applied. First, a new threshold value defining two clusters may be generated. Next, new membership coefficients for each data point are generated based on clusters centroids, as well as distance from each cluster centroid.
As one can see, the middle data point belongs to cluster A and cluster B. the value of 0.3 is this data point's membership coefficient for cluster A
Algorithmic steps for Fuzzy c-means clustering
Let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centers.
-
Randomly select ‘c’ cluster centers.
-
Calculate the fuzzy membership 'µij' using:
- Compute the fuzzy centers 'vj' using:
- Repeat step 2) and 3) until the minimum 'J' value is achieved or ||U(k+1) - U(k)|| < β.
where,
‘k’ is the iteration step.
‘β’ is the termination criterion between [0, 1]. ‘U = (µij)n*c’ is the fuzzy membership matrix. ‘J’ is the objective function.
Advantages
1) Gives best result for overlapped data set and comparatively better then k-means algorithm.
2) Unlike k-means where data point must exclusively belong to one cluster center here data point
is assigned membership to each cluster center as a result of which data point may belong to more
then one cluster center.
Disadvantages
1) Apriori specification of the number of clusters.
2) With lower value of β we get the better result but at the expense of more number of iteration.
3) Euclidean distance measures can unequally weight underlying factors.