K‐Means Clustering - ECE-180D-WS-2024/Wiki-Knowledge-Base GitHub Wiki

Introduction

Data is present everywhere today, and a simple click of a button on a smartphone can be collected and utilized to provide predictions, patterns, connections, relationships, and much more. A wide range of data science methods are available for achieving this allocation; from straightforward methods like classification (by category), to more detail-oriented methods like anomaly detection (trends/unusual patterns) and linear regression (statistical analysis). Human input remains crucial for these methods as data still needs manual labeling, or identification, deeming them supervised techniques. However, some scenarios contain ambiguous information - ubiquitous data remnants that require nuanced differentiation due to the challenges of the data belonging to overlapping domains. To tackle this, a method known as K-means clustering offers a structured approach by simplifying the complexity with a framework by grouping similar data points, an 'n' number of objects, into distinct "clusters," referred to as 'k.' Above all, K-means clustering is an unsupervised technique, requiring minimal human input, making it a powerful tool.

Brief Understanding

K-means clustering inherently discovers data based on similarity, by grouping a specified number of clusters based solely on the features of the data and the distances between them. The algorithm is a mechanism to minimize the variance within the clusters while maximizing the distinction between them. In supervised learning techniques, human supervision assumes the responsibility of labeling data before algorithmic integration. This may become harder to manage as more data gets involved, not to mention the risk of human error. For example, if there are two categories: "a" and "b," the user (human) would have to label the data as "a" or "b." However, the flaws of this method lie within the challenges of building a well-defined data set. It isn't guaranteed that all the provided data could be classified as "a" or "b."

In comparison, the human interaction necessary for K-means clustering would be to set the value of 'k.' To elaborate, if 'k' were set to equal a value of 2, K-means would partition the data into 'k' clusters, transitioning them organically through a 'natural grouping' process without any predefined notion of what these groups should be. Therefore, this method discovers groupings in data by itself, proving its malleability by presenting outcomes that were not originally an identified goal. As a result, the K-means algorithm would present two clusters of data based on commonality.

Diving Deeper

[3] K-Means Algorithm [Fig. 1]

The K-Means Algorithm [Fig. 1] method organizes a dataset with 'n' data points into 'k' clusters. Each cluster is defined by a central point referred to as the centroid. The iteration begins by initially defining centroids, followed by the calculation of the distance function [Fig. 1] to determine the nearest centroid for each data point. The index 'j' corresponds to the cluster number (ranging from 1 to 'k'), and 'i' corresponds to each data point in the correlating 'j' cluster. The objective function, represented by 'J' [Fig. 1], is the WCSS (Within-Cluster Sum of Square) - the sum of the squared distances between each data point and its corresponding centroid within the cluster [5]. Through iterative recalculations and reassignments of the data points, the algorithm refines the clusters to minimize 'J.' This ensures that the data points are grouped into clusters with minimized internal variance and that each data point is as close as possible to the centroid of their assigned clusters.

[5] Within-Cluster Sum of Square - "Elbow" Point [Fig. 2]

Now that the process of the K-means algorithm is thoroughly explained, comes its application. Earlier, it was mentioned that human interaction is still necessary to operate the K-means method. The user must experiment by inputting different 'k' values to gather the corresponding WCSS (within-cluster sum of square) for each 'k.' The WCSS measures the compactness of the clusters - lower values indicate 'tighter' clusters. Each WCSS is then plotted to its corresponding 'k' to identify the 'elbow point' [Fig. 2], often referred to as 'The Elbow Method.' The elbow point showcases the rapid decrease in WCSS (indicating tighter clusters) while pinpointing the optimal number of clusters. The stagnation of the graph indicates that increasing the number of clusters will still yield the same compactness of clusters. Having too many clusters can be counter-productive and redundant, defeating the purpose of K-means. Therefore, this experimentation proposes the most balanced and effective way to identify the number of clusters to declare for 'k' when utilizing the K-means algorithm. This method may seem tedious for the user but is much more concise in comparison to unsupervised learning techniques. Modern data analysis and machine learning tools provide the option to automate this task as well, which can further automate the K-means method.

[2] K-means Algorithm Steps: Initialization, Classification, Centroid Calculation, and Convergence [Fig. 3]

In summary, datasets get iterated through four steps for the K-means algorithm: initialization, classification, centroid calculation, and convergence [Fig. 3]. To review, the experimentation with different 'k' values to identify the 'elbow point' is conducted in preparation for the algorithm. Once initiated, the algorithm enters initialization by randomly selecting 'k' points from the dataset and forming clusters [Fig. 4]. Classification is when the algorithm assigns the data points to the nearest centroid, determining the distance between the points and the correlating centroid - establishing the clusters. Through centroid calculation, the locations of each centroid are recalculated for accuracy. Although classification and centroid calculations are defined as separate steps, the reiteration of these steps falls into the stage of convergence due to the objective of cluster optimization [Fig. 3 & 5]. During this final stage, the algorithm also checks for significant movement of the centroids to ensure cluster stability [Fig. 6].

[2a] Initialization [Fig. 4]

[2b] The Reiteration of Classification and Centroid Calculation Processes (Convergence Stage) [Fig. 5]

[2c] K-Means Final Convergence [Fig. 6]

Ideas behind a Least-Square Problem

From a mathematical perspective, K-means clustering is essentially a practice of the Least Square problems which also known as a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals (in this case the difference is between the samples of each pre-classified clusters and our input samples) made in the results of each individual equation. Based on the outcome we get by solving a least square problem, we would be able to tell whether the classification of our clusters in its optimal status or not. A least squares problem is generally called a Gauss–Newton algorithm, which has the following form

[1] Least Square Problem [Fig. 1]

In a general least square problem, given a matrix A and vector b, find a vector X such that minimizes the error of their norm. If we plug this equation into a K-mean cluster problem, here matrix A represents the collected input samples, X represents the orders of the optimal solution of clusters and b represents the group of clusters. Given the condition we can safely assume that this is going to be a linear problem as each observation are linearly classified by a linear regression model and this implies that each column of A is linearly independent of each of its adjacent columns. Guided by this fact, the solution of a linear least squares problem is given as the following

[2] Least Square Problem solution [Fig. 2]

In this equation, our solution vector x is factored into the product of inverse of R, transpose of Q and b, this factorization is called the QR factorization and it is a typical matrix operation used in order to optimize its computational complexity in a software program.

When the size of our classification grows, as more clusters and samples are introduced to this data fitting problem. We are prompted to use a more robust and more tolerated model to help us achieve this goal. Since our classification can no longer fit a linear regression model, our least squares problem has now become an nonlinear problem which requires an iterative procedure to solve for a solution. However, this could possibly cause irreversible offsets during the classification process and resulting in sequential divergence of classification. Motivated by the issue of divergence, a damping factor (Lambda, λ) is included in the original Least Square Equation to guarantee local convergence of its solution. This modification is called the Levenberg-Marquardt algorithm and its equation looks like the following.

[3] Levenberg–Marquardt algorithm [Fig. 3]

where D here measures the gradient of the original Least Square Equation with the fidelity term included here changes accordingly with respect to the error in each iteration to avoid overfitting or underfitting.

In mathematics and computing, the LMA is used in many software applications for solving generic curve-fitting problems, not just for solving classification problems. Nonetheless, in some cases, using the Gauss–Newton algorithm (K-mean algorithm) it often converges faster than first-order methods, the LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA.

Real-World Applications

K-means clustering, an unsupervised learning algorithm, has become increasingly popular in the retail industry due to its robust capability to analyze copious amounts of consumer data. As shoppers express their preferences through every click and purchase, they generate a wealth of information on purchase history, spending habits, ad responses, shopping preferences, and more. This data, when organized effectively, benefits companies to detect overlapping customer preferences and historical sales patterns, while encompassing prices, styles, and volumes. Grouping this data using K-means clustering enables retailers to tailor their offerings and marketing strategies with greater precision, aiming to enhance foot traffic, revenue, and customer engagement [3] through personalized product recommendations.

From a supervised standpoint, humans are responsible for labeling the datasets. They will need to carefully define categories for the data to be homed with precise allocation, which would be a laborious task. For example, a retail dataset might have levels such as "regular-priced," "on sale," and "returning customers." While machine learning tools can assist in organizing the data into these predefined categories, they will rely on initial human-defined labels. Moreover, some data points may not fit neatly into these categories, such as those related to 'new customers' or 'window shoppers.' These outliers would necessitate the creation of additional categories, complicating the process. Although supervised methods can be tailored to match the granularity of unsupervised techniques, the effort to label and define categories extensively makes it a tedious and potentially impractical approach compared to the automatic clustering achievable by the K-means algorithm.

In contrast, an unsupervised technique like K-means clustering simplifies the process by requiring the human to only experiment with various 'k' values to locate the 'elbow point,' which determines the ideal number of clusters. For instance, if 'k' were set to three, the K-means algorithm autonomously performs iterations [Fig. 4-6], refining and forming three clusters without needing any predefined categories. Once the algorithm completes its processing, it will organize the data into three distinct clusters based on the inherent similarities among the data points. These clusters, while initially unlabelled, can subsequently be analyzed and interpreted based on their characteristics. As a result, it could potentially reveal different consumer behaviors such as frequent large purchasers, responsive ad-driven purchasers, consistent returning customers, and others. This post-clustering analysis allows for the flexible interpretation of clusters based on the comprehensive dataset.

K-means supports retailers in deriving actionable insights without the need for manually predefining categories by autonomously identifying natural groupings within ubiquitous datasets. This unsupervised technique unveils complex patterns and relationships that might not have been developed through traditional methods. Therefore, K-means clustering proves to be a valuable resource in the retail industry by offering an innovative option to magnify marketing strategies, strategize product placements, and ultimately execute increased customer satisfaction and business growth.

Conclusion

In conclusion, while supervised learning requires extensive human input for accurately labeling and categorizing data, K-means clustering offers a more gracious and streamlined approach that showcases its efficiency in understanding data. Supervised learning techniques present rigid challenges in scalability when faced with the dynamic and varied aspects of data while being reliant on human input for labeling. K-means clustering exemplifies its strengths of unsupervised learning by offering flexibility and scalability while autonomously discovering patterns within the data. This approach not only saves time while increasing productivity but also extends avenues for uncovering nuanced insights within overwhelming datasets.

References

[1] https://www.intechopen.com/chapters/66538
[2] https://journals.sagepub.com/doi/10.1155/2015/615740#:~:text=The%20k%2Dmeans%20clustering%20algorithm,Hartigan%20and%20Wong%20%5B4%5D.
[3] https://medium.com/data-and-beyond/customer-segmentation-using-k-means-clustering-with-pyspark-unveiling-insights-for-business-8c729f110fab#:~:text=K%2Dmeans%20clustering%20is%20a,are%20similar%20to%20each%20other.
[4] https://www.analyticsvidhya.com/blog/2019/04/introduction-image-segmentation-techniques-python
[5] https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/