ML2 ‐ Lec (8) - RenadShamrani/test GitHub Wiki
📊 Clustering
Definition: Grouping similar objects into clusters.
Key Idea: Objects within a cluster should be similar, while objects in different clusters should be dissimilar.
🏆 Why Clustering?
- ✅ Unsupervised Learning (No labeled data)
- ✅ Finds Natural Patterns
- ✅ Used in:
- 🔍 Search engines (grouping results)
- 📂 Organizing documents
- 🏥 Medical diagnosis
🔍 Classification vs. Clustering
Feature | Classification 🎯 | Clustering 🔗 |
---|---|---|
Learning Type | Supervised (labeled data) | Unsupervised (no labels) |
Goal | Predict class labels | Group similar instances |
Example | Spam detection (spam or not) | Customer segmentation |
🎯 Goal of Clustering
✔ Maximize similarity within a cluster
✔ Minimize similarity between clusters
📌 Two Key Variations:
- Within-cluster variation (WCV) → Minimize 🚫
- Between-cluster variation (BCV) → Maximize ✅
🏗 Types of Clustering
1️⃣ Partitional Clustering
- Clusters formed at once
- Examples:
- ✅ K-Means Clustering
- ✅ Fuzzy C-Means
- ✅ QT Clustering
2️⃣ Hierarchical Clustering
- Agglomerative (Bottom-Up): Merge clusters until one remains
- Divisive (Top-Down): Start with one cluster and split recursively
🎭 Hard vs. Soft Clustering
✔ Hard Clustering: Each object belongs to one cluster.
✔ Soft Clustering: Objects can belong to multiple clusters.
Example:
- 👟 Sneakers in (1) Sports Apparel & (2) Shoes categories → Soft Clustering ✅
📏 How to Measure Similarity?
✔ Euclidean Distance 📏
✔ City-block Distance 🏙
✔ Minkowski Distance 🔢
🎯 K-Means Clustering
✔ Goal: Partition data into K clusters.
✔ Steps:
1️⃣ Initialize K random cluster centers.
2️⃣ Assign each point to the nearest cluster.
3️⃣ Recalculate centroids.
4️⃣ Repeat until centroids stop changing.
📌 Termination Conditions:
- Centroids stop moving
- Max iterations reached
- Sum of Squared Errors (SSE) stabilizes
📌 Formula for SSE: [ SSE = \sum_{i=1}^{k} \sum_{x_j \in C_i} (x_j - \mu_i)^2 ] where ( \mu_i ) is the centroid of cluster ( C_i ).
🏆 How to Choose K?
✔ Try different values of K
✔ Select K with the smallest SSE
✔ Use Elbow Method 📉 (Find where SSE stops decreasing significantly)
📊 Hierarchical Clustering
✔ Builds a hierarchy of clusters.
✔ Creates a dendrogram (tree structure) 🌳
✔ Does not require pre-setting K.
📌 Types:
- Agglomerative (Bottom-Up): Merge clusters
- Divisive (Top-Down): Split clusters
📌 Linkage Methods:
- Single-Link (Nearest Neighbor) → Merge closest points
- Complete-Link (Farthest Neighbor) → Merge farthest points
- Centroid-Link → Merge based on cluster centroids
🔄 Comparison: K-Means vs Hierarchical Clustering
Feature | K-Means 🎯 | Hierarchical 🌳 |
---|---|---|
Type | Partitional | Hierarchical |
Speed | Faster | Slower |
Number of Clusters (K) | Must be pre-set | Auto-detected |
Best for | Large datasets | Small datasets |
🎯 Key Takeaways
✅ K-Means → Best for large datasets, requires predefined K
✅ Hierarchical Clustering → Best for small datasets, auto-detects clusters
✅ Elbow Method → Helps choose K
1. Clustering 🎯
- What?: Grouping similar objects into clusters.
- Goal: Maximize similarity within clusters, minimize similarity between clusters.
- Types:
- Partitional Clustering: Divide data into non-overlapping clusters (e.g., K-Means).
- Hierarchical Clustering: Build a tree of clusters (e.g., Agglomerative, Divisive).
2. K-Means Clustering 🔢
- Steps:
- Randomly initialize k centroids.
- Assign each point to the nearest centroid.
- Recalculate centroids as the mean of assigned points.
- Repeat until convergence (no change in centroids).
- Termination:
- Centroids stop changing.
- Max iterations reached.
- SSE (Sum of Squared Errors) stabilizes.
3. Hierarchical Clustering 🌳
- Agglomerative (Bottom-Up):
- Start with each point as a cluster.
- Merge closest clusters iteratively.
- Divisive (Top-Down):
- Start with all points in one cluster.
- Split clusters iteratively.
- Dendrogram: Tree diagram showing cluster merges/splits.
4. Distance Measures 📏
- Single Link: Distance between closest points in clusters.
- Complete Link: Distance between farthest points in clusters.
- Centroid: Distance between cluster centroids.
5. Key Concepts 🔑
- SSE (Sum of Squared Errors): Measures clustering quality (lower = better).
- Dendrogram: Visual representation of hierarchical clustering.
- K-Means: Requires predefined k (number of clusters).
- Hierarchical Clustering: No need for predefined k.
Mind Map 🧠
Clustering
├── Partitional Clustering
│ ├── K-Means
│ │ ├── Steps: Initialize, Assign, Recalculate, Repeat
│ │ └── Termination: Centroids, Max Iterations, SSE
│ └── Other: Fuzzy C-Means, QT Clustering
└── Hierarchical Clustering
├── Agglomerative (Bottom-Up)
├── Divisive (Top-Down)
└── Dendrogram (Visualization)
Key Symbols 🔑
k
: Number of clusters.SSE
: Sum of Squared Errors.D
: Distance matrix.C
: Cluster.
You’re ready! 🎉 Just remember K-Means = partitional, Hierarchical = tree of clusters, and Dendrogram = visualize merges/splits! 🚀