19 02 Hierarchical Clustering - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
(Agglomerative) Hierarchical Clustering
- A “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- How does it work?
- Make each data point a single-point cluster → forms N clusters
- Take the two closest data points and make them one cluster → forms N-1 clusters
- Take the two closest clusters and make them one cluster → Forms N-2 clusters.
- Repeat step-3 until you are left with only one cluster.
01 Basics of hierarchical clustering
from scipy.cluster.hierarchy import linkage, fcluster
Create a distance matrix using linkage
linkage(observations, method='single', metric='euclidean', optimal_ording = False)
method
: how to calculate the proximity of clusters- single: based on two closest objects
- complete: based on two farthest objects
- average: based on the arithmetic mean of all objects
- centroid: based on the geometric mean of all objects
- median: based on the median of all objects
- ward: based on the sum of squares
- No one right method for all, need to carefully understand the distribution of data
metric
: distance metricoptimal_ordering
: order data points
Create cluster labels with fcluster
fcluster(distance_matrix, num_clusters, criterion)
distance_matrix
: output oflinkage()
methodnum_clusters
criterion
: how to decide thresholds to form clusters
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage
# Use the linkage() function
distance_matrix = linkage(comic_con['x_scaled', 'y_scaled'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_scaled',-'y_scaled'), method = 'ward', metric = 'euclidean')
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
02 Visualize clusters
- Try to make sense of the clusters formed
- An addition step in validation of clusters
- Spot trends in data
sns.scatterplot(x='x',
y='y',
hue='labels',
data=df)
03 Determine the number of clusters Dendrogram
Introduction to Dendrograms
- Strategy till now - decide clusters on visual inspection
- Dendrogram help in showing progressions as clusters are merged
- A dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into his child nodes
Create a dendrogram in Scipy
from scipy.cluster.hierarchy import dendrogram
Z = linkage(df['x_whiten','y_whiten'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_whiten','y_whiten'),
method='ward',
metric='euclidean')
dn = dendrogram(Z)
plt.show()
04 Limitations
Measuring speed
-
timeit
module -
Measure the speed of
.linkage()
method%timeit linkage(...)
-
Increasing runtime with data points
-
Quadratic increase of runtime
-
Not feasible for large datasets