7. K Means Algorithm - ZYL-Harry/Machine_Learning_study GitHub Wiki

Introduction

function: categorize the items into k groups or clusters of similarity

Process

randomly initialize two points(for two clusters) as the cluster centroids
cluster assignmnet step: go through each of the examples, depending on whether it is closer to which cluster centroid(using the euclidean distance to compute the distance), it is assigned to one of the cluster centroid---divide into two clusters
move centroids step: move the original cluster centroids to the average of the points in each cluster as the new cluster centroids
repeat the step 1,2,3 until the clusters seemed to be satisfying

Tip:
When computing the new clusters points, if there are some original cluster points which have no example points assigned to, then we can get rid of it or initialize the cluster points again

Optimization Objective

function:
1.use optimization objective to check whtether the K-means algorithm is working well or not
2.use optimization objective to help K-means to find better clusters and avoid local optimal points\
Parameters:
1.c^{i} = index of the cluster to which example x^{i}(i=1,2,...,m) is currently assigned
2.μ_{k} = cluster centroid k(k=1,2,...,K)
3.μ_{c^{i}} = cluster centroid of cluster to which example x^{i} has been assigned
Cost function(Distrotion cost function):

In the algorithm, because of the order of the algorithm, we should first minimize the cost function with c^{i}, and then minimize it with μ_{k}

Randomly Initialization

Method 1: randomly pick K training examples as the K initial cluster centroids
Problem: If the initial cluster centroids are not good enough, it can also lead to a bad result
Method improved: run many times K means with different initial cluster centroids, and then pick the one with the lowest cost function

Tip:
This kind of method is only very helpful when the number of clusters is small(2-10), if the number of clusters is large, then it doesn't come out with a much better one

Choosing the number of clusters

Common method: looking at the examples and choosing the number of clusters manually
Elbow Method:
Other situation:

Exercise by python

read the dataset

def read_data(path):
    data = loadmat(path)
    X = data['X']
    return X

visualize the dataset

def visualize_data(X):
    x1 = X[:, 0]
    x2 = X[:, 1]
    plt.figure()
    plt.scatter(x=x1, y=x2, color='k', marker='o')
    plt.show()

Output:
ex7data_kmeans

find the closest centroids for data

def find_closest_centroids(X, centroids, K):
    X_index = np.matrix(np.zeros((X.shape[0], 1)))
    distance = np.matrix(np.zeros((X.shape[0], 1)))
    distance_temp = np.matrix(np.zeros((K, 1)))
    for m in range(X.shape[0]):
        for k in range(K):
            distance_temp[k, 0] = np.power(np.sum(np.power((X[m, :] - centroids[k, :]), 2)), 0.5)   # compute the distance between data points and present cnetroids
        X_index[m, 0] = np.argmin(distance_temp)
        distance[m, 0] = distance_temp[int(X_index[m, 0]), 0]
    return X_index

compute the new centroids

def compute_centroids(X, X_index, K):
    new_centroids = np.matrix(np.zeros((K, X.shape[1])))
    for k in range(K):
        X_k = np.matrix(X[np.argwhere(X_index == k)[:, 0], :])
        new_centroids[k, :] = np.sum(X_k, axis=0) / X_k.shape[0]
    return new_centroids

run the K-means algorithm

def k_means(X, K, initial_centroids, max_iter):
    present_centroids = initial_centroids
    new_centroids = initial_centroids
    for i in range(max_iter):
        # find the closest centroids for data
        X_index = find_closest_centroids(X, present_centroids, K)
        # compute the new centroids
        new_centroids = compute_centroids(X, X_index, K)
        present_centroids = new_centroids
    return new_centroids, X_index

Output:

the new cluster centroids are
[[1.95399466 5.02557006]
[3.04367119 1.01541041]
[6.03366736 3.00052511]]

visualize the new clusters

def visualize_cluster(centroids, X_index, X):
    plt.figure()
    color = np.matrix(['r', 'g', 'b'])
    for k in range(centroids.shape[0]):
        X_k = np.matrix(X[np.argwhere(X_index == k)[:, 0], :])
        x1 = X_k[:, 0].flatten().A[0]
        x2 = X_k[:, 1].flatten().A[0]
        plt.scatter(x=x1, y=x2, color=color[0, k], marker='o')
    plt.scatter(x=centroids[:, 0].flatten().A[0], y=centroids[:, 1].flatten().A[0], color='k', marker='+')
    plt.show()

Output:
ex7classify_kmeans

main function

'''K-Means'''
# read the dataset
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/ex7data2.mat'
X1 = read_data(path1)
# visualize the dataset
visualize_data(X1)
# initialize parameters
K = 3
initial_centroids = np.matrix([3, 3], [6, 2], [8, 5](/ZYL-Harry/Machine_Learning_study/wiki/3,-3],-[6,-2],-[8,-5))
# run the k-means algorithm
max_iter = 10
new_centroids, X_index = k_means(X1, K, initial_centroids, max_iter)
print('the new cluster centroids are \n', new_centroids)
# visualize the new clusters
visualize_cluster(new_centroids, X_index, X1)

The method above is based on the basic mathemetical model of k-means, the following one is based on the code in sklearn.cluster.KMeans(Here is the official website: sklearn.cluster.KMeans):
main function:

sklearn_cluster(X1, K)

sklearn_cluster function:

def sklearn_cluster(X, K):
    model_original = KMeans(n_clusters=K)
    print('the KMeas model is ', model_original)
    model = model_original.fit(X)
    centroids = model_original.cluster_centers_
    print('the centroids computed by sklearn.cluster.KMeans are \n', centroids)
    X_index = model_original.predict(X)
    visualize_cluster(centroids, X_index, X)

Output:

the centroids computed by sklearn.cluster.KMeans are
[[3.04367119 1.01541041]
[1.95399466 5.02557006]
[6.03366736 3.00052511]]