19 02 Hierarchical Clustering - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

(Agglomerative) Hierarchical Clustering

  • A “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • How does it work?
    1. Make each data point a single-point cluster → forms N clusters
    2. Take the two closest data points and make them one cluster → forms N-1 clusters
    3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
    4. Repeat step-3 until you are left with only one cluster.

01 Basics of hierarchical clustering

  • from scipy.cluster.hierarchy import linkage, fcluster

Create a distance matrix using linkage

linkage(observations, method='single', metric='euclidean', optimal_ording = False)
  • method: how to calculate the proximity of clusters
    • single: based on two closest objects
    • complete: based on two farthest objects
    • average: based on the arithmetic mean of all objects
    • centroid: based on the geometric mean of all objects
    • median: based on the median of all objects
    • ward: based on the sum of squares
    • No one right method for all, need to carefully understand the distribution of data
  • metric: distance metric
  • optimal_ordering: order data points

Create cluster labels with fcluster

fcluster(distance_matrix, num_clusters, criterion)
  • distance_matrix: output of linkage() method
  • num_clusters
  • criterion: how to decide thresholds to form clusters
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(comic_con['x_scaled', 'y_scaled'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_scaled',-'y_scaled'), method = 'ward', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

02 Visualize clusters

  • Try to make sense of the clusters formed
  • An addition step in validation of clusters
  • Spot trends in data
sns.scatterplot(x='x',
                y='y',
                hue='labels',
                data=df)

03 Determine the number of clusters Dendrogram

Introduction to Dendrograms

  • Strategy till now - decide clusters on visual inspection
  • Dendrogram help in showing progressions as clusters are merged
  • A dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into his child nodes

Create a dendrogram in Scipy

from scipy.cluster.hierarchy import dendrogram
Z = linkage(df['x_whiten','y_whiten'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'x_whiten','y_whiten'),
            method='ward',
            metric='euclidean')
dn = dendrogram(Z)
plt.show()

04 Limitations

Measuring speed

  • timeit module

  • Measure the speed of .linkage() method %timeit linkage(...)

  • Increasing runtime with data points

  • Quadratic increase of runtime

  • Not feasible for large datasets