7. K Means Algorithm - ZYL-Harry/Machine_Learning_study GitHub Wiki
Introduction
- function: categorize the items into k groups or clusters of similarity
Process
- randomly initialize two points(for two clusters) as the cluster centroids
- cluster assignmnet step: go through each of the examples, depending on whether it is closer to which cluster centroid(using the euclidean distance to compute the distance), it is assigned to one of the cluster centroid---divide into two clusters
- move centroids step: move the original cluster centroids to the average of the points in each cluster as the new cluster centroids
- repeat the step 1,2,3 until the clusters seemed to be satisfying
Tip:
When computing the new clusters points, if there are some original cluster points which have no example points assigned to, then we can get rid of it or initialize the cluster points again
Optimization Objective
- function:
1.use optimization objective to check whtether the K-means algorithm is working well or not
2.use optimization objective to help K-means to find better clusters and avoid local optimal points\ - Parameters:
1.c^{i} = index of the cluster to which example x^{i}(i=1,2,...,m) is currently assigned
2.μ_{k} = cluster centroid k(k=1,2,...,K)
3.μ_{c^{i}} = cluster centroid of cluster to which example x^{i} has been assigned - Cost function(Distrotion cost function):
In the algorithm, because of the order of the algorithm, we should first minimize the cost function with c^{i}, and then minimize it with μ_{k}
Randomly Initialization
- Method 1: randomly pick K training examples as the K initial cluster centroids
- Problem: If the initial cluster centroids are not good enough, it can also lead to a bad result
- Method improved: run many times K means with different initial cluster centroids, and then pick the one with the lowest cost function
Tip:
This kind of method is only very helpful when the number of clusters is small(2-10), if the number of clusters is large, then it doesn't come out with a much better one
Choosing the number of clusters
- Common method: looking at the examples and choosing the number of clusters manually
- Elbow Method:
- Other situation:
Exercise by python
- read the dataset
def read_data(path):
data = loadmat(path)
X = data['X']
return X
- visualize the dataset
def visualize_data(X):
x1 = X[:, 0]
x2 = X[:, 1]
plt.figure()
plt.scatter(x=x1, y=x2, color='k', marker='o')
plt.show()
Output:
- find the closest centroids for data
def find_closest_centroids(X, centroids, K):
X_index = np.matrix(np.zeros((X.shape[0], 1)))
distance = np.matrix(np.zeros((X.shape[0], 1)))
distance_temp = np.matrix(np.zeros((K, 1)))
for m in range(X.shape[0]):
for k in range(K):
distance_temp[k, 0] = np.power(np.sum(np.power((X[m, :] - centroids[k, :]), 2)), 0.5) # compute the distance between data points and present cnetroids
X_index[m, 0] = np.argmin(distance_temp)
distance[m, 0] = distance_temp[int(X_index[m, 0]), 0]
return X_index
- compute the new centroids
def compute_centroids(X, X_index, K):
new_centroids = np.matrix(np.zeros((K, X.shape[1])))
for k in range(K):
X_k = np.matrix(X[np.argwhere(X_index == k)[:, 0], :])
new_centroids[k, :] = np.sum(X_k, axis=0) / X_k.shape[0]
return new_centroids
- run the K-means algorithm
def k_means(X, K, initial_centroids, max_iter):
present_centroids = initial_centroids
new_centroids = initial_centroids
for i in range(max_iter):
# find the closest centroids for data
X_index = find_closest_centroids(X, present_centroids, K)
# compute the new centroids
new_centroids = compute_centroids(X, X_index, K)
present_centroids = new_centroids
return new_centroids, X_index
Output:
the new cluster centroids are
[[1.95399466 5.02557006]
[3.04367119 1.01541041]
[6.03366736 3.00052511]]
- visualize the new clusters
def visualize_cluster(centroids, X_index, X):
plt.figure()
color = np.matrix(['r', 'g', 'b'])
for k in range(centroids.shape[0]):
X_k = np.matrix(X[np.argwhere(X_index == k)[:, 0], :])
x1 = X_k[:, 0].flatten().A[0]
x2 = X_k[:, 1].flatten().A[0]
plt.scatter(x=x1, y=x2, color=color[0, k], marker='o')
plt.scatter(x=centroids[:, 0].flatten().A[0], y=centroids[:, 1].flatten().A[0], color='k', marker='+')
plt.show()
Output:
- main function
'''K-Means'''
# read the dataset
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/ex7data2.mat'
X1 = read_data(path1)
# visualize the dataset
visualize_data(X1)
# initialize parameters
K = 3
initial_centroids = np.matrix([3, 3], [6, 2], [8, 5](/ZYL-Harry/Machine_Learning_study/wiki/3,-3],-[6,-2],-[8,-5))
# run the k-means algorithm
max_iter = 10
new_centroids, X_index = k_means(X1, K, initial_centroids, max_iter)
print('the new cluster centroids are \n', new_centroids)
# visualize the new clusters
visualize_cluster(new_centroids, X_index, X1)
The method above is based on the basic mathemetical model of k-means, the following one is based on the code in sklearn.cluster.KMeans
(Here is the official website: sklearn.cluster.KMeans):
main function:
sklearn_cluster(X1, K)
sklearn_cluster function:
def sklearn_cluster(X, K):
model_original = KMeans(n_clusters=K)
print('the KMeas model is ', model_original)
model = model_original.fit(X)
centroids = model_original.cluster_centers_
print('the centroids computed by sklearn.cluster.KMeans are \n', centroids)
X_index = model_original.predict(X)
visualize_cluster(centroids, X_index, X)
Output:
the centroids computed by sklearn.cluster.KMeans are
[[3.04367119 1.01541041]
[1.95399466 5.02557006]
[6.03366736 3.00052511]]