ML2 ‐ Lec (8) - RenadShamrani/test GitHub Wiki

📊 Clustering

Definition: Grouping similar objects into clusters.
Key Idea: Objects within a cluster should be similar, while objects in different clusters should be dissimilar.

🏆 Why Clustering?

✅ Unsupervised Learning (No labeled data)
✅ Finds Natural Patterns
✅ Used in:
- 🔍 Search engines (grouping results)
- 📂 Organizing documents
- 🏥 Medical diagnosis

🔍 Classification vs. Clustering

Feature	Classification 🎯	Clustering 🔗
Learning Type	Supervised (labeled data)	Unsupervised (no labels)
Goal	Predict class labels	Group similar instances
Example	Spam detection (spam or not)	Customer segmentation

🎯 Goal of Clustering

✔ Maximize similarity within a cluster
✔ Minimize similarity between clusters

📌 Two Key Variations:

Within-cluster variation (WCV) → Minimize 🚫
Between-cluster variation (BCV) → Maximize ✅

🏗 Types of Clustering

1️⃣ Partitional Clustering

Clusters formed at once
Examples:
- ✅ K-Means Clustering
- ✅ Fuzzy C-Means
- ✅ QT Clustering

2️⃣ Hierarchical Clustering

Agglomerative (Bottom-Up): Merge clusters until one remains
Divisive (Top-Down): Start with one cluster and split recursively

🎭 Hard vs. Soft Clustering

✔ Hard Clustering: Each object belongs to one cluster.
✔ Soft Clustering: Objects can belong to multiple clusters.

Example:

👟 Sneakers in (1) Sports Apparel & (2) Shoes categories → Soft Clustering ✅

📏 How to Measure Similarity?

✔ Euclidean Distance 📏
✔ City-block Distance 🏙
✔ Minkowski Distance 🔢

🎯 K-Means Clustering

✔ Goal: Partition data into K clusters.
✔ Steps: 1️⃣ Initialize K random cluster centers.
2️⃣ Assign each point to the nearest cluster.
3️⃣ Recalculate centroids.
4️⃣ Repeat until centroids stop changing.

📌 Termination Conditions:

Centroids stop moving
Max iterations reached
Sum of Squared Errors (SSE) stabilizes

📌 Formula for SSE: [ SSE = \sum_{i=1}^{k} \sum_{x_j \in C_i} (x_j - \mu_i)^2 ] where ( \mu_i ) is the centroid of cluster ( C_i ).

🏆 How to Choose K?

✔ Try different values of K
✔ Select K with the smallest SSE
✔ Use Elbow Method 📉 (Find where SSE stops decreasing significantly)

📊 Hierarchical Clustering

✔ Builds a hierarchy of clusters.
✔ Creates a dendrogram (tree structure) 🌳
✔ Does not require pre-setting K.

📌 Types:

Agglomerative (Bottom-Up): Merge clusters
Divisive (Top-Down): Split clusters

📌 Linkage Methods:

Single-Link (Nearest Neighbor) → Merge closest points
Complete-Link (Farthest Neighbor) → Merge farthest points
Centroid-Link → Merge based on cluster centroids

🔄 Comparison: K-Means vs Hierarchical Clustering

Feature	K-Means 🎯	Hierarchical 🌳
Type	Partitional	Hierarchical
Speed	Faster	Slower
Number of Clusters (K)	Must be pre-set	Auto-detected
Best for	Large datasets	Small datasets

🎯 Key Takeaways

✅ K-Means → Best for large datasets, requires predefined K
✅ Hierarchical Clustering → Best for small datasets, auto-detects clusters
✅ Elbow Method → Helps choose K

1. Clustering 🎯

What?: Grouping similar objects into clusters.
Goal: Maximize similarity within clusters, minimize similarity between clusters.
Types:
- Partitional Clustering: Divide data into non-overlapping clusters (e.g., K-Means).
- Hierarchical Clustering: Build a tree of clusters (e.g., Agglomerative, Divisive).

2. K-Means Clustering 🔢

Steps:
1. Randomly initialize k centroids.
2. Assign each point to the nearest centroid.
3. Recalculate centroids as the mean of assigned points.
4. Repeat until convergence (no change in centroids).
Termination:
- Centroids stop changing.
- Max iterations reached.
- SSE (Sum of Squared Errors) stabilizes.

3. Hierarchical Clustering 🌳

Agglomerative (Bottom-Up):
- Start with each point as a cluster.
- Merge closest clusters iteratively.
Divisive (Top-Down):
- Start with all points in one cluster.
- Split clusters iteratively.
Dendrogram: Tree diagram showing cluster merges/splits.

4. Distance Measures 📏

Single Link: Distance between closest points in clusters.
Complete Link: Distance between farthest points in clusters.
Centroid: Distance between cluster centroids.

5. Key Concepts 🔑

SSE (Sum of Squared Errors): Measures clustering quality (lower = better).
Dendrogram: Visual representation of hierarchical clustering.
K-Means: Requires predefined k (number of clusters).
Hierarchical Clustering: No need for predefined k.

Mind Map 🧠

Clustering
├── Partitional Clustering
│   ├── K-Means
│   │   ├── Steps: Initialize, Assign, Recalculate, Repeat
│   │   └── Termination: Centroids, Max Iterations, SSE
│   └── Other: Fuzzy C-Means, QT Clustering
└── Hierarchical Clustering
    ├── Agglomerative (Bottom-Up)
    ├── Divisive (Top-Down)
    └── Dendrogram (Visualization)

Key Symbols 🔑

k: Number of clusters.
SSE: Sum of Squared Errors.
D: Distance matrix.
C: Cluster.

You’re ready! 🎉 Just remember K-Means = partitional, Hierarchical = tree of clusters, and Dendrogram = visualize merges/splits! 🚀