ML2 ‐ Lec (8) - RenadShamrani/test GitHub Wiki

📊 Clustering

Definition: Grouping similar objects into clusters.
Key Idea: Objects within a cluster should be similar, while objects in different clusters should be dissimilar.

🏆 Why Clustering?

  • Unsupervised Learning (No labeled data)
  • Finds Natural Patterns
  • Used in:
    • 🔍 Search engines (grouping results)
    • 📂 Organizing documents
    • 🏥 Medical diagnosis

🔍 Classification vs. Clustering

Feature Classification 🎯 Clustering 🔗
Learning Type Supervised (labeled data) Unsupervised (no labels)
Goal Predict class labels Group similar instances
Example Spam detection (spam or not) Customer segmentation

🎯 Goal of Clustering

Maximize similarity within a cluster
Minimize similarity between clusters

📌 Two Key Variations:

  • Within-cluster variation (WCV) → Minimize 🚫
  • Between-cluster variation (BCV) → Maximize ✅

🏗 Types of Clustering

1️⃣ Partitional Clustering

  • Clusters formed at once
  • Examples:
    • K-Means Clustering
    • Fuzzy C-Means
    • QT Clustering

2️⃣ Hierarchical Clustering

  • Agglomerative (Bottom-Up): Merge clusters until one remains
  • Divisive (Top-Down): Start with one cluster and split recursively

🎭 Hard vs. Soft Clustering

Hard Clustering: Each object belongs to one cluster.
Soft Clustering: Objects can belong to multiple clusters.

Example:

  • 👟 Sneakers in (1) Sports Apparel & (2) Shoes categories → Soft Clustering

📏 How to Measure Similarity?

Euclidean Distance 📏
City-block Distance 🏙
Minkowski Distance 🔢


🎯 K-Means Clustering

Goal: Partition data into K clusters.
Steps: 1️⃣ Initialize K random cluster centers.
2️⃣ Assign each point to the nearest cluster.
3️⃣ Recalculate centroids.
4️⃣ Repeat until centroids stop changing.

📌 Termination Conditions:

  • Centroids stop moving
  • Max iterations reached
  • Sum of Squared Errors (SSE) stabilizes

📌 Formula for SSE: [ SSE = \sum_{i=1}^{k} \sum_{x_j \in C_i} (x_j - \mu_i)^2 ] where ( \mu_i ) is the centroid of cluster ( C_i ).


🏆 How to Choose K?

✔ Try different values of K
✔ Select K with the smallest SSE
✔ Use Elbow Method 📉 (Find where SSE stops decreasing significantly)


📊 Hierarchical Clustering

Builds a hierarchy of clusters.
Creates a dendrogram (tree structure) 🌳
Does not require pre-setting K.

📌 Types:

  • Agglomerative (Bottom-Up): Merge clusters
  • Divisive (Top-Down): Split clusters

📌 Linkage Methods:

  • Single-Link (Nearest Neighbor) → Merge closest points
  • Complete-Link (Farthest Neighbor) → Merge farthest points
  • Centroid-Link → Merge based on cluster centroids

🔄 Comparison: K-Means vs Hierarchical Clustering

Feature K-Means 🎯 Hierarchical 🌳
Type Partitional Hierarchical
Speed Faster Slower
Number of Clusters (K) Must be pre-set Auto-detected
Best for Large datasets Small datasets

🎯 Key Takeaways

K-Means → Best for large datasets, requires predefined K
Hierarchical Clustering → Best for small datasets, auto-detects clusters
Elbow Method → Helps choose K



1. Clustering 🎯

  • What?: Grouping similar objects into clusters.
  • Goal: Maximize similarity within clusters, minimize similarity between clusters.
  • Types:
    • Partitional Clustering: Divide data into non-overlapping clusters (e.g., K-Means).
    • Hierarchical Clustering: Build a tree of clusters (e.g., Agglomerative, Divisive).

2. K-Means Clustering 🔢

  • Steps:
    1. Randomly initialize k centroids.
    2. Assign each point to the nearest centroid.
    3. Recalculate centroids as the mean of assigned points.
    4. Repeat until convergence (no change in centroids).
  • Termination:
    • Centroids stop changing.
    • Max iterations reached.
    • SSE (Sum of Squared Errors) stabilizes.

3. Hierarchical Clustering 🌳

  • Agglomerative (Bottom-Up):
    • Start with each point as a cluster.
    • Merge closest clusters iteratively.
  • Divisive (Top-Down):
    • Start with all points in one cluster.
    • Split clusters iteratively.
  • Dendrogram: Tree diagram showing cluster merges/splits.

4. Distance Measures 📏

  • Single Link: Distance between closest points in clusters.
  • Complete Link: Distance between farthest points in clusters.
  • Centroid: Distance between cluster centroids.

5. Key Concepts 🔑

  • SSE (Sum of Squared Errors): Measures clustering quality (lower = better).
  • Dendrogram: Visual representation of hierarchical clustering.
  • K-Means: Requires predefined k (number of clusters).
  • Hierarchical Clustering: No need for predefined k.

Mind Map 🧠

Clustering
├── Partitional Clustering
│   ├── K-Means
│   │   ├── Steps: Initialize, Assign, Recalculate, Repeat
│   │   └── Termination: Centroids, Max Iterations, SSE
│   └── Other: Fuzzy C-Means, QT Clustering
└── Hierarchical Clustering
    ├── Agglomerative (Bottom-Up)
    ├── Divisive (Top-Down)
    └── Dendrogram (Visualization)

Key Symbols 🔑

  • k: Number of clusters.
  • SSE: Sum of Squared Errors.
  • D: Distance matrix.
  • C: Cluster.

You’re ready! 🎉 Just remember K-Means = partitional, Hierarchical = tree of clusters, and Dendrogram = visualize merges/splits! 🚀