K means algorithm - SoojungHong/MachineLearning GitHub Wiki

K-means algorithms

k-means has lots of variations to optimize for certain types of data.

At a high level, they all do something like this:

k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
What we have are k clusters, and each patient is now a member of a cluster.
k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
This center becomes the new centroid for the cluster.
Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

Drawback of K-means algorithm

Considering how popular k-means is, yes, it's definitely run on clusters and most decent machine learning libraries have an implementation. It's complexity goes up linearly in k

, the number of data points, and the number of dimensions. So it becomes infeasible in high dimensional spaces with many points. Because there aren't really any convolutions involved, I'm not sure of the advantage you'd get running it on a GPU.

As an example, if you try running k

-means on the entire wikipedia corpus, using word features (i.e. each document being a vector of words from a preset dictionary), it takes around a day on a small cluster running Spark.

Something like spherical k

-means is used extensively in compressed sensing. The only difference here is that distances are normalized to 1 and cosine similarity is used for error. Such compressions might easily occur in real-time applications, like transmission of data over long distances, so speed would be highly valued.

As far as applications go, any sort of real-time k -means with a short memory will be a huge bottle neck. As an example, imagine using k

-means to study topic trends on twitter over the course of a day. Depending on the day's events, the clusters will move around significantly and you're getting a huge stream of points at any given time.

As well, k -means is rarely run once (although the precision of this depends on the implementation) since it's output is usually random. This overhead can be a pain when dealing with a ton of points.