9.4.1.k Means Clustering - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
Intro to Clustering
Clustering for segmentation
Imagine that you have a customer dataset and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups of customers, so as to more effectively allocate marketing resources.
Imagine that you have a customer dataset and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups of customers, so as to more effectively allocate marketing resources.
Let's learn how to divide a set of customers into categories, based on characteristics they share.
One of the most adopted approaches that can be used for customer segmentation is clustering. Clustering can group data only unsupervised, based on the similarity of customers to each other. It will partition your customers into mutually exclusive groups. For example, into three clusters. The customers in each cluster are similar to each other demographically. Now we can create a profile for each group, considering the common characteristics of each cluster. For example, the first group made up of affluent and middle aged customers. The second is made up of young, educated and middle income customers, and the third group includes young and low income customers. Finally, we can assign each individual in our dataset to one of these groups or segments of customers. Now imagine that you cross join this segmented dataset with the dataset of the product or services that customers purchase from your company. This information would really help to understand and predict the differences and individual customers preferences and their buying behaviors across various products. Indeed, having this information would allow your company to develop highly personalized experiences for each segment. Customer segmentation is one of the popular usages of clustering. Cluster analysis also has many other applications in different domains.
What is clustering?
Clustering means finding clusters in a dataset, unsupervised.
- What is a cluster?
- A group of data points or objects in a dataset that are similar to other objects in the group, and dissimilar to data points in other clusters.
Clustering vs. Classification
Classification algorithms predict categorical classed labels. This means assigning instances to predefined classes such as defaulted or not defaulted.
For example, if an analyst wants to analyze customer data in order to know which customers might default on their payments, she uses a labeled dataset as training data and uses classification approaches such as a decision tree, Support Vector Machines or SVM, or logistic regression, to predict the default value for a new or unknown customer. Generally speaking, classification is a supervised learning where each training data instance belongs to a particular class.
In clustering however, the data is unlabeled and the process is unsupervised.
For example, we can use a clustering algorithm such as k-means to group similar customers as mentioned, and assign them to a cluster, based on whether they share similar attributes, such as; age, education, and so on. While I'll be giving you some examples in different industries, I'd like you to think about more samples of clustering.
Question
"Clustering algorithms predict categorical class labels", is it TRUE or FALSE?
TRUE- FALSE
Correct
Clustering applications
- RETAIL/MARKETING:
- Identifying buying patterns of customers
- Recommending new books or movies to new customers
- BANKING:
- Fraud detection in credit card use
- Identifying clusters of customers (e.g., loyal)
- INSURANCE:
- Fraud detection in claims analysis
- Insurance risk of customers
- PUBLICATION:
- Auto-categorizing news based on their content
- Recommending similar news articles
- MEDICINE:
- Characterizing patient behavior
- BIOLOGY:
- Clustering genetic markers to identify family ties
Why clustering?
- Exploratory data analysis
- Summary generation
- Outlier detection
- Finding duplicates
- Pre-processing step
Clustering algorithms
-
Partitioned-based Clustering
- Relatively efficient
- E.g. k-Means, k-Median, Fuzzy c-Means
-
Hierarchical Clustering
- Produces trees of clusters
- E.g. Agglomerative, Divisive
-
Density-based Clustering
- Produces arbitrary shaped clusters
- E.g. DBSCAN
Intro to k-Means
What is k-Means clustering?
One of the algorithms that can be used for customer segmentation is K-Means clustering. K-Means can group data only unsupervised based on the similarity of customers to each other.
k-Means algorithms
- Partitioning Clustering
- k-means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure
- Examples within a cluster are vary similar
- Examples across different clusters are very different
Determine the similarity or dissimilarity
Question
What is the objective of k-means?
- To form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters.
- To minimize the “intra cluster” distances and maximize the “inter-cluster” distances.
- To divide the data into non-overlapping clusters without any cluster-internal structure
Correct
1-dimensional similarity/distance
2-dimensional similarity/distance
Multi-dimensional similarity/distance
Indeed, the similarity measure highly controls how the clusters are formed, so it is recommended to understand the domain knowledge of your dataset and datatype of features and then choose the meaningful distance measurement.
How does k-Means clustering work?
let's see how K-Means clustering works. For the sake of simplicity, let's assume that our dataset has only two features: the age and income of customers. This means, it's a two-dimensional space. We can show the distribution of customers using a scatter plot: The Y-axis indicates age and the X-axis shows income of customers. We try to cluster the customer dataset into distinct groups or clusters based on these two dimensions.
k-Means clustering - initialize k
-
Initialize k=3 centroids randomly
-
Distance calculation
-
Assign each point to the closest centroid
SSE = the sum of the squared differences between each point and its centroid
-
Compute the new centroids for each cluster
-
Repeat until there are no more changes
Whenever a centroid moves, each points distance to the centroid needs to be measured again. Yes, K-Means is an iterative algorithm and we have to repeat steps two to four until the algorithm converges. In each iteration, it will move the centroids, calculate the distances from new centroids and assign data points to the nearest centroid. It results in the clusters with minimum error or the most dense clusters. However, as it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters. It means, this algorithm is guaranteed to converge to a result, but the result may be a local optimum i.e. not necessarily the best possible outcome.
To solve this problem, it is common to run the whole process multiple times with different starting conditions. This means with randomized starting centroids, it may give a better outcome. As the algorithm is usually very fast, it wouldn't be any problem to run it multiple times.
More on k-Means
k-Means clustering algorithm
- Randomly placing k centroids, one of each cluster
- Calculate the distance of each point from each centroid
- Assign each data point (object) to its closest centroid, creating a cluster
- Recalculate the position of the k centroids
- Repeat the steps 2-4, until the centroids no longer move
k-Means accuracy
- External approach
- Compare the clusters with the ground truth, if it is available.
- Internal approach
- Average the distance between data points within a cluster.
Choosing k
Essentially, determining the number of clusters in a data set, or k as in the k-Means algorithm, is a frequent problem in data clustering. The correct choice of K is often ambiguous because it's very dependent on the shape and scale of the distribution of points in a dataset.
There are some approaches to address this problem, but one of the techniques that is commonly used is to run the clustering across the different values of K and looking at a metric of accuracy for clustering.
This metric can be mean, distance between data points and their cluster's centroid, which indicate how dense our clusters are or, to what extent we minimize the error of clustering.
Then, looking at the change of this metric, we can find the best value for K.
But the problem is that with increasing the number of clusters, the distance of centroids to data points will always reduce. This means increasing K will always decrease the error.
So, the value of the metric as a function of K is plotted and the elbow point is determined where the rate of decrease sharply shifts. It is the right K for clustering. This method is called the elbow method.
Question
In clustering evaluation process, "elbow point" is where the rate of accuracy increase sharply, when we run clustering multiple times, increasing k in each run.
TRUE- FALSE
Correct
k-Means recap
- Med and Large sized databases (Relatively efficient)
- Produces sphere-like clusters
- Needs number of clusters (k)