9.4.1.k Means Clustering - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Intro to Clustering

Clustering for segmentation

Imagine that you have a customer dataset and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups of customers, so as to more effectively allocate marketing resources.

Untitled

Let's learn how to divide a set of customers into categories, based on characteristics they share.

One of the most adopted approaches that can be used for customer segmentation is clustering. Clustering can group data only unsupervised, based on the similarity of customers to each other. It will partition your customers into mutually exclusive groups. For example, into three clusters. The customers in each cluster are similar to each other demographically. Now we can create a profile for each group, considering the common characteristics of each cluster. For example, the first group made up of affluent and middle aged customers. The second is made up of young, educated and middle income customers, and the third group includes young and low income customers. Finally, we can assign each individual in our dataset to one of these groups or segments of customers. Now imagine that you cross join this segmented dataset with the dataset of the product or services that customers purchase from your company. This information would really help to understand and predict the differences and individual customers preferences and their buying behaviors across various products. Indeed, having this information would allow your company to develop highly personalized experiences for each segment. Customer segmentation is one of the popular usages of clustering. Cluster analysis also has many other applications in different domains.

Untitled

What is clustering?

Clustering means finding clusters in a dataset, unsupervised.

What is a cluster?
- A group of data points or objects in a dataset that are similar to other objects in the group, and dissimilar to data points in other clusters.

Untitled

Clustering vs. Classification

Classification algorithms predict categorical classed labels. This means assigning instances to predefined classes such as defaulted or not defaulted.

Untitled

For example, if an analyst wants to analyze customer data in order to know which customers might default on their payments, she uses a labeled dataset as training data and uses classification approaches such as a decision tree, Support Vector Machines or SVM, or logistic regression, to predict the default value for a new or unknown customer. Generally speaking, classification is a supervised learning where each training data instance belongs to a particular class.

In clustering however, the data is unlabeled and the process is unsupervised.

Untitled

For example, we can use a clustering algorithm such as k-means to group similar customers as mentioned, and assign them to a cluster, based on whether they share similar attributes, such as; age, education, and so on. While I'll be giving you some examples in different industries, I'd like you to think about more samples of clustering.

Question

"Clustering algorithms predict categorical class labels", is it TRUE or FALSE?

~~TRUE~~
FALSE

Correct

Clustering applications

RETAIL/MARKETING:
- Identifying buying patterns of customers
- Recommending new books or movies to new customers
BANKING:
- Fraud detection in credit card use
- Identifying clusters of customers (e.g., loyal)
INSURANCE:
- Fraud detection in claims analysis
- Insurance risk of customers
PUBLICATION:
- Auto-categorizing news based on their content
- Recommending similar news articles
MEDICINE:
- Characterizing patient behavior
BIOLOGY:
- Clustering genetic markers to identify family ties

Why clustering?

Exploratory data analysis
Summary generation
Outlier detection
Finding duplicates
Pre-processing step

Clustering algorithms

Partitioned-based Clustering
- Relatively efficient
- E.g. k-Means, k-Median, Fuzzy c-Means
Hierarchical Clustering
- Produces trees of clusters
- E.g. Agglomerative, Divisive
Density-based Clustering
- Produces arbitrary shaped clusters
- E.g. DBSCAN

Intro to k-Means

What is k-Means clustering?

One of the algorithms that can be used for customer segmentation is K-Means clustering. K-Means can group data only unsupervised based on the similarity of customers to each other.

Untitled

k-Means algorithms

Partitioning Clustering
k-means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure
Examples within a cluster are vary similar
Examples across different clusters are very different

Untitled

Determine the similarity or dissimilarity

Untitled

Question

What is the objective of k-means?

To form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters.
To minimize the “intra cluster” distances and maximize the “inter-cluster” distances.
To divide the data into non-overlapping clusters without any cluster-internal structure

Correct

1-dimensional similarity/distance

Untitled

2-dimensional similarity/distance

Untitled

Multi-dimensional similarity/distance

Untitled

Indeed, the similarity measure highly controls how the clusters are formed, so it is recommended to understand the domain knowledge of your dataset and datatype of features and then choose the meaningful distance measurement.

How does k-Means clustering work?

let's see how K-Means clustering works. For the sake of simplicity, let's assume that our dataset has only two features: the age and income of customers. This means, it's a two-dimensional space. We can show the distribution of customers using a scatter plot: The Y-axis indicates age and the X-axis shows income of customers. We try to cluster the customer dataset into distinct groups or clusters based on these two dimensions.

Untitled

k-Means clustering - initialize k

Initialize k=3 centroids randomly
Distance calculation
Assign each point to the closest centroid

SSE = the sum of the squared differences between each point and its centroid
Compute the new centroids for each cluster
Repeat until there are no more changes

Whenever a centroid moves, each points distance to the centroid needs to be measured again. Yes, K-Means is an iterative algorithm and we have to repeat steps two to four until the algorithm converges. In each iteration, it will move the centroids, calculate the distances from new centroids and assign data points to the nearest centroid. It results in the clusters with minimum error or the most dense clusters. However, as it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters. It means, this algorithm is guaranteed to converge to a result, but the result may be a local optimum i.e. not necessarily the best possible outcome.

To solve this problem, it is common to run the whole process multiple times with different starting conditions. This means with randomized starting centroids, it may give a better outcome. As the algorithm is usually very fast, it wouldn't be any problem to run it multiple times.

More on k-Means

k-Means clustering algorithm

Randomly placing k centroids, one of each cluster
Calculate the distance of each point from each centroid
Assign each data point (object) to its closest centroid, creating a cluster
Recalculate the position of the k centroids
Repeat the steps 2-4, until the centroids no longer move

k-Means accuracy

External approach
- Compare the clusters with the ground truth, if it is available.
Internal approach
- Average the distance between data points within a cluster.

Choosing k

Essentially, determining the number of clusters in a data set, or k as in the k-Means algorithm, is a frequent problem in data clustering. The correct choice of K is often ambiguous because it's very dependent on the shape and scale of the distribution of points in a dataset.

There are some approaches to address this problem, but one of the techniques that is commonly used is to run the clustering across the different values of K and looking at a metric of accuracy for clustering.

This metric can be mean, distance between data points and their cluster's centroid, which indicate how dense our clusters are or, to what extent we minimize the error of clustering.

Then, looking at the change of this metric, we can find the best value for K.

But the problem is that with increasing the number of clusters, the distance of centroids to data points will always reduce. This means increasing K will always decrease the error.

So, the value of the metric as a function of K is plotted and the elbow point is determined where the rate of decrease sharply shifts. It is the right K for clustering. This method is called the elbow method.

Untitled

Question

In clustering evaluation process, "elbow point" is where the rate of accuracy increase sharply, when we run clustering multiple times, increasing k in each run.

~~TRUE~~
FALSE

Correct

k-Means recap

Med and Large sized databases (Relatively efficient)
Produces sphere-like clusters
Needs number of clusters (k)