explanation each code line by line experiment 4b - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki
Aim
Compute the Rand Index for different clustering methods on the spiral.txt
dataset and visualize the dataset to determine which algorithm best recovers the true clusters.
Code Explanation
Here's a step-by-step explanation of the provided code:
Import Libraries
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt
numpy
: For handling numerical operations and data loading.KMeans
: For K-means clustering.AgglomerativeClustering
: For hierarchical clustering.adjusted_rand_score
: For computing the Rand Index.matplotlib.pyplot
: For plotting and visualizing data.
Load and Prepare the Dataset
data = np.loadtxt("Spiral.txt", delimiter=",", skiprows=1)
X = data[:, :2] # Features
y_true = data[:, 2] # Actual cluster labels
np.loadtxt
: Loads the dataset from a text file.X
: Features (first two columns).y_true
: Actual cluster labels (third column).
Visualize the Dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
plt.scatter
: Plots the data points with colors representing true clusters.cmap='viridis'
: Uses a color map to differentiate clusters.
Perform K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)
KMeans
: Initializes K-means clustering with 3 clusters.fit_predict
: Fits the model and predicts cluster assignments.
Perform Single-link Hierarchical Clustering
single_link = AgglomerativeClustering(n_clusters=3, linkage='single')
single_link_clusters = single_link.fit_predict(X)
AgglomerativeClustering
: Performs single-link (minimum distance) hierarchical clustering.
Perform Complete-link Hierarchical Clustering
complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_clusters = complete_link.fit_predict(X)
AgglomerativeClustering
: Performs complete-link (maximum distance) hierarchical clustering.
Compute the Rand Index
rand_index_kmeans = adjusted_rand_score(y_true, kmeans_clusters)
rand_index_single_link = adjusted_rand_score(y_true, single_link_clusters)
rand_index_complete_link = adjusted_rand_score(y_true, complete_link_clusters)
print("Rand Index for K-means Clustering:", rand_index_kmeans)
print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)
adjusted_rand_score
: Computes the Rand Index between true labels and predicted clusters.print
: Displays the Rand Index for each clustering method.
Output
The code will output the Rand Index for each clustering method:
Rand Index for K-means Clustering: [Value]
Rand Index for Single-link Hierarchical Clustering: [Value]
Rand Index for Complete-link Hierarchical Clustering: [Value]
Analysis
- Rand Index: Measures the agreement between the predicted clusters and true clusters. It ranges from 0 (no agreement) to 1 (perfect agreement).
- The algorithm with the highest Rand Index is better at recovering the true clusters.
Visualization and Interpretation
- Visualize the Dataset: The plot will show how the true clusters are distributed.
- Which Algorithm Recovers True Clusters Best?:
- K-means Clustering: Typically works well with spherical clusters.
- Single-link Hierarchical Clustering: Can handle irregular shapes but might merge clusters too early.
- Complete-link Hierarchical Clustering: Generally performs better with more compact clusters and can separate them well.
Based on the Rand Index values, you'll be able to determine which algorithm best recovers the true clusters. The method with the highest Rand Index will have the closest match to the actual cluster structure.