19 02 Hierarchical Clustering - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

(Agglomerative) Hierarchical Clustering

A “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms N-1 clusters
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.

01 Basics of hierarchical clustering

from scipy.cluster.hierarchy import linkage, fcluster

Create a distance matrix using linkage

linkage(observations, method='single', metric='euclidean', optimal_ording = False)

method: how to calculate the proximity of clusters
- single: based on two closest objects
- complete: based on two farthest objects
- average: based on the arithmetic mean of all objects
- centroid: based on the geometric mean of all objects
- median: based on the median of all objects
- ward: based on the sum of squares
- No one right method for all, need to carefully understand the distribution of data
metric: distance metric
optimal_ordering: order data points

Create cluster labels with fcluster

fcluster(distance_matrix, num_clusters, criterion)

distance_matrix: output of linkage() method
num_clusters
criterion: how to decide thresholds to form clusters

# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(comic_con['x_scaled', 'y_scaled'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_scaled',-'y_scaled'), method = 'ward', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

02 Visualize clusters

Try to make sense of the clusters formed
An addition step in validation of clusters
Spot trends in data

sns.scatterplot(x='x',
                y='y',
                hue='labels',
                data=df)

03 Determine the number of clusters Dendrogram

Introduction to Dendrograms

Strategy till now - decide clusters on visual inspection
Dendrogram help in showing progressions as clusters are merged
A dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into his child nodes

Create a dendrogram in Scipy

from scipy.cluster.hierarchy import dendrogram

Z = linkage(df['x_whiten','y_whiten'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_whiten','y_whiten'),
            method='ward',
            metric='euclidean')
dn = dendrogram(Z)
plt.show()

04 Limitations

Measuring speed

timeit module
Measure the speed of .linkage() method %timeit linkage(...)
Increasing runtime with data points
Quadratic increase of runtime
Not feasible for large datasets