K means algorithm in the context of text mining - SoojungHong/TextMining GitHub Wiki

K-Means algorithm

At a high level, they all do something like this:(with example, patient vector with several features)

k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.

Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.

What we have are k clusters, and each patient is now a member of a cluster.

k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).

This center becomes the new centroid for the cluster. Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.

Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

Why K-means algorithm

The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.

k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore whether there are overlooked patterns or relationships in the dataset.

Weakness of K-means algorithm

Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data.