explanation each code line by line experiment 4b - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki

Aim

Compute the Rand Index for different clustering methods on the spiral.txt dataset and visualize the dataset to determine which algorithm best recovers the true clusters.

Code Explanation

Here's a step-by-step explanation of the provided code:

Import Libraries

import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt
  • numpy: For handling numerical operations and data loading.
  • KMeans: For K-means clustering.
  • AgglomerativeClustering: For hierarchical clustering.
  • adjusted_rand_score: For computing the Rand Index.
  • matplotlib.pyplot: For plotting and visualizing data.

Load and Prepare the Dataset

data = np.loadtxt("Spiral.txt", delimiter=",", skiprows=1)
X = data[:, :2] # Features
y_true = data[:, 2] # Actual cluster labels
  • np.loadtxt: Loads the dataset from a text file.
  • X: Features (first two columns).
  • y_true: Actual cluster labels (third column).

Visualize the Dataset

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
  • plt.scatter: Plots the data points with colors representing true clusters.
  • cmap='viridis': Uses a color map to differentiate clusters.

Perform K-means Clustering

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)
  • KMeans: Initializes K-means clustering with 3 clusters.
  • fit_predict: Fits the model and predicts cluster assignments.

Perform Single-link Hierarchical Clustering

single_link = AgglomerativeClustering(n_clusters=3, linkage='single')
single_link_clusters = single_link.fit_predict(X)
  • AgglomerativeClustering: Performs single-link (minimum distance) hierarchical clustering.

Perform Complete-link Hierarchical Clustering

complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_clusters = complete_link.fit_predict(X)
  • AgglomerativeClustering: Performs complete-link (maximum distance) hierarchical clustering.

Compute the Rand Index

rand_index_kmeans = adjusted_rand_score(y_true, kmeans_clusters)
rand_index_single_link = adjusted_rand_score(y_true, single_link_clusters)
rand_index_complete_link = adjusted_rand_score(y_true, complete_link_clusters)
print("Rand Index for K-means Clustering:", rand_index_kmeans)
print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)
  • adjusted_rand_score: Computes the Rand Index between true labels and predicted clusters.
  • print: Displays the Rand Index for each clustering method.

Output

The code will output the Rand Index for each clustering method:

Rand Index for K-means Clustering: [Value]
Rand Index for Single-link Hierarchical Clustering: [Value]
Rand Index for Complete-link Hierarchical Clustering: [Value]

Analysis

  • Rand Index: Measures the agreement between the predicted clusters and true clusters. It ranges from 0 (no agreement) to 1 (perfect agreement).
  • The algorithm with the highest Rand Index is better at recovering the true clusters.

Visualization and Interpretation

  • Visualize the Dataset: The plot will show how the true clusters are distributed.
  • Which Algorithm Recovers True Clusters Best?:
    • K-means Clustering: Typically works well with spherical clusters.
    • Single-link Hierarchical Clustering: Can handle irregular shapes but might merge clusters too early.
    • Complete-link Hierarchical Clustering: Generally performs better with more compact clusters and can separate them well.

Based on the Rand Index values, you'll be able to determine which algorithm best recovers the true clusters. The method with the highest Rand Index will have the closest match to the actual cluster structure.