9.4.2.Hierarchical Clustering - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Intro to Hierarchical Clustering

Hierarchical clustering algorithms build a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes.

Hierarchical Clustering

Strategies for hierarchical clustering generally fall into two types, divisive and agglomerative.

Untitled

Divisive is top down, so you start with all observations in a large cluster and break it down into smaller pieces. Think about divisive as dividing the cluster.

Agglomerative is the opposite of divisive. So it is bottom up, where each observation starts in its own cluster and pairs of clusters are merged together as they move up the hierarchy. Agglomeration means to amass or collect things, which is exactly what this does with the cluster.

Question

Which of the following sentences are TRUE about hierarchical clustering?

  • Hierarchical clustering algorithms build a hierarchy of clusters where each node is a cluster consists of the clusters of its daughter nodes.
  • In Divisive Hierarchical Clustering, you start with all observations in a large cluster and break it down into smaller pieces.
  • In Agglomerative Hierarchical Clustering all observations fall on one cluster, and pairs of clusters are split as they move up the hierarchy.

Correct

Agglomerative clustering

This method builds the hierarchy from the individual elements by progressively merging clusters.

Untitled

In our example, let's say we want to cluster six cities in Canada based on their distances from one another. They are Toronto, Ottawa, Vancouver, Montreal, Winnipeg, and Edmonton.

Untitled

We construct a distance matrix at this stage, where the numbers in the row i column j is the distance between the i and j cities. In fact, this table shows the distances between each pair of cities.

The algorithm is started by assigning each city to its own cluster. So if we have six cities, we have six clusters each containing just one city.

The first step is to determine which cities, let's call them clusters from now on, to merge into a cluster. Usually, we want to take the two closest clusters according to the chosen distance. Looking at the distance matrix, Montreal and Ottawa are the closest clusters so we make a cluster out of them.

We have to merge these two closest cities in the distance matrix as well. So, rows and columns are merged as the cluster is constructed.

As you can see in the distance matrix, rows and columns related to Montreal and Ottawa cities are merged as the cluster is constructed. Then, the distances from all cities to this new merged cluster get updated. But how?

For example, how do we calculate the distance from Winnipeg to the Ottawa/Montreal cluster? We just select the distance from the center of the Ottawa/Montreal cluster to Winnipeg.

Updating the distance matrix, we now have one less cluster.

Next, we look for the closest clusters once again. In this case, Ottawa, Montreal, and Toronto are the closest ones which creates another cluster.

In the next step, the closest distances between the Vancouver cluster and the Edmonton cluster. Forming a new cluster, the data in the matrix table gets updated.

Essentially, the rows and columns are merged as the clusters are merged and the distance updated. This is a common way to implement this type of clustering and has the benefit of caching distances between clusters.

In the same way, agglomerative algorithm proceeds by merging clusters, and we repeat it until all clusters are merged and the tree becomes completed. It means, until all cities are clustered into a single cluster of size six.

Untitled

Hierarchical clustering is typically visualized as a dendrogram as shown on this slide. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged where cities are viewed as singleton clusters. By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering. Essentially, hierarchical clustering does not require a prespecified number of clusters.

However, in some applications, we want a partition of disjoint clusters just as in flat clustering. In those cases, the hierarchy needs to be cut at some point. For example here, cutting in a specific level of similarity, we create three clusters of similar cities.

More on Hierarchical Clustering

Agglomerative algorithm

Untitled

  1. Create n clusters, one for each data point
  2. Compute the Proximity Matrix
  3. Repeat
    1. Merge the two closet clusters
    2. Update the proximity matrix
  4. Until only a single cluster remains

Similarity/Distance

If we have a data set of n patience, we can build an n by n dissimilarity distance matrix. It will give us the distance of clusters with one data point.

However, as mentioned, we merge clusters in agglomerative clustering. Now the question is, how can we calculate the distance between clusters when there are multiple patients in each cluster?

Untitled

Distance between clusters

  • Single-Linkage Clustering

    • Minimum distance between clusters

      Untitled

  • Complete-Linkage Clustering

    • Maximum distance between clusters

      Untitled

  • Average Linkage Clustering

    • Average distance between clusters

      Untitled

  • Centroid Linkage Clustering

    • Distance between clusters centroids

      Untitled

    Advantages vs Disadvantages

    Advantages Disadvantages
    Doesn't required number of clusters to be specified. Can never undo any previous steps throughout the algorithm
    Easy to implement. Generally has long runtimes.
    Produces a dendrogram, which helps with understanding the data. Sometimes difficult to identify the number of clusters by the dendrogram.

    Hierarchical clustering vs kmeans

    k-mean Hierachical clustering
    1.Much more efficient 1.Can be slow for large datasets
    2.Requires the number of clusters to be specified 2.Does not require the number of clusters to run
    3.Gives only one partitioning of the data based on the predefined number of clusters 3.Gives more than one partitioning depending on the resolution
    4.Potentially returns different clusters each time it is run due to random initialization of centroids 4.Always generates the same clusters