ML2 ‐ revision - RenadShamrani/test GitHub Wiki

Here’s a comprehensive and simplified explanation of all the concepts you asked about, broken down step by step! 🚀


1. Hyperplane 📏

  • What?: A decision boundary that separates data into classes in SVM.
  • Example: In 2D, it’s a line; in 3D, it’s a plane.

2. Overfitting vs. Underfitting 📊

  • Overfitting: Model learns noise in the training data (high variance).
  • Underfitting: Model is too simple to capture patterns (high bias).

3. Bias and Variance ⚖️

  • Bias: Error due to overly simplistic assumptions.
  • Variance: Error due to sensitivity to small fluctuations in the training set.

4. Margins 📏

  • What?: Distance between the hyperplane and the closest data points.
  • Hard Margin: No misclassifications allowed.
  • Soft Margin: Allows misclassifications using slack variables (ξ).

5. When to Allow Misclassification? 🛠️

  • When data is not linearly separable.
  • Use soft margin with slack variables (ξ).

6. Objective Function in SVM 🎯

  • What to Minimize?:
    \frac{1}{2} ||w||^2 + C \sum \xi_i
    
    • ||w||²: Maximizes the margin.
    • C ∑ ξ_i: Penalizes misclassifications.

7. Constraints in SVM 🔗

  • Hard Margin:
    y_i(w · x_i + b) ≥ 1
    
  • Soft Margin:
    y_i(w · x_i + b) ≥ 1 - \xi_i, \xi_i ≥ 0
    

8. Lagrangian in SVM 🌀

  • What?: Combines the objective function with constraints.
  • Formula:
    L(w, b, \alpha) = \frac{1}{2} ||w||^2 - \sum \alpha_i [y_i(w · x_i + b) - 1]
    
    • α_i: Lagrange multipliers.

9. Loss Function in SVM (Hinge Loss) 📉

  • Formula:

    L(y, f(x)) = \max(0, 1 - y \cdot f(x))
    
    • y: Actual class (-1 or 1).
    • f(x): Model’s prediction.
  • When is it 0?:

    • When the prediction is correct and confident (y · f(x) ≥ 1).
  • When is it affected?:

    • When the prediction is incorrect or not confident (y · f(x) < 1).

10. Slack Variables (ξ) 🛠️

  • What?: Allow misclassifications in SVM.
  • Values:
    • 0 < ξ ≤ 1: Point is correctly classified but within the margin.
    • ξ > 1: Point is misclassified.

11. Regularization Parameter (C) ⚖️

  • What?: Controls the trade-off between maximizing the margin and minimizing classification errors.
  • If C is large:
    • Strict classification (small margin, fewer misclassifications).
  • If C is small:
    • Lenient classification (large margin, more misclassifications).

12. Loss Functions for Common Models 📊

Model Loss Function
Linear Regression Mean Squared Error (MSE): L(y, f(x)) = (y - f(x))²
Logistic Regression Log Loss: L(y, f(x)) = -[y log(f(x)) + (1-y) log(1-f(x))]
SVM Hinge Loss: L(y, f(x)) = max(0, 1 - y · f(x))
Decision Trees Gini Index or Entropy (for classification), MSE (for regression).
Neural Networks Cross-Entropy (classification), MSE (regression).

13. Hard Margin vs. Soft Margin Equations 📏

  • Hard Margin:
    \text{Minimize } \frac{1}{2} ||w||^2
    \text{Subject to } y_i(w · x_i + b) ≥ 1
    
  • Soft Margin:
    \text{Minimize } \frac{1}{2} ||w||^2 + C \sum \xi_i
    \text{Subject to } y_i(w · x_i + b) ≥ 1 - \xi_i, \xi_i ≥ 0
    

14. Handling Multi-Class Data 🎯

  • What?: SVM is binary, but can handle multi-class data using:
    • One-vs-One: Train a classifier for every pair of classes.
    • One-vs-All: Train a classifier for each class vs the rest.

15. Kernel Trick 🌀

  • What?: Transforms data into higher dimensions without explicit computation.
  • Types of Kernels:
    • Linear Kernel: K(x_i, x_j) = x_i · x_j
    • Polynomial Kernel: K(x_i, x_j) = (x_i · x_j + r)^d
      • d: Degree of polynomial.
    • RBF (Gaussian) Kernel: K(x_i, x_j) = exp(-γ ||x_i - x_j||²)
      • γ: Controls the influence of each point.
    • Sigmoid Kernel: K(x_i, x_j) = tanh(η x_i · x_j + ν)
      • η, ν: Parameters.

16. Gamma (γ) in RBF Kernel 🎚️

  • High γ:
    • Small influence of each point (tight clusters, overfitting).
  • Low γ:
    • Large influence of each point (smooth clusters, underfitting).

17. Kernel Trick Concept 🧠

  • What?: Computes dot products in higher dimensions without explicit transformation.
  • Why?: Makes it possible to handle complex, non-linear data.

18. Decision Trees 🌳

  • How to Build?:
    • Choose the root based on Entropy or Gini Index.
    • Split data recursively based on the best feature.
  • Entropy:
    Entropy(S) = -\sum_{i=1}^c p_i \log_2(p_i)
    
  • Gini Index:
    Gini(S) = 1 - \sum_{i=1}^c p_i^2
    

19. Pre-Pruning vs. Post-Pruning ✂️

  • Pre-Pruning: Stop growing the tree early (e.g., limit depth).
  • Post-Pruning: Grow the full tree, then remove unnecessary branches.

20. Ensemble Learning 🤝

  • Bagging: Train multiple models on different subsets of data (e.g., Random Forest).
  • Boosting: Sequentially train models, focusing on misclassified samples (e.g., AdaBoost).
  • Stacking: Combine predictions of multiple models using a meta-model.

21. PCA (Principal Component Analysis) 📊

  • Steps:
    1. Standardize data.
    2. Compute covariance matrix.
    3. Find eigenvalues and eigenvectors.
    4. Select top PCs (highest eigenvalues).

22. Feature Selection Methods 🎯

  • Filter Methods: Use statistical measures (e.g., correlation, mutual information).
  • Wrapper Methods: Use a model to evaluate feature subsets (e.g., forward selection).
  • Embedded Methods: Feature selection during model training (e.g., Lasso).

23. Mutual Information (MI) 🔗

  • What?: Measures dependency between features and target.
  • MI = 0: No dependency.
  • MI = 0.5: Strong dependency.

24. Recursive Feature Elimination (RFE) 🔄

  • What?: Iteratively removes the least important features.
  • Stops when: Desired number of features is reached.

25. PCA for Categorical Data 🌀

  • What?: Use MCA (Multiple Correspondence Analysis) or t-SNE.

26. Centralization in PCA 🎯

  • What?: Subtract the mean from each feature.
    Z = X - \mu
    

27. Eigenvalues and Eigenvectors 🌀

  • What?: Solve:
    \text{det}(\text{Covariance Matrix} - \lambda I) = 0
    
    • λ: Eigenvalues.
    • v: Eigenvectors.

Mind Map 🧠

Machine Learning Concepts
├── SVM
│   ├── Hyperplane
│   ├── Margins (Hard/Soft)
│   ├── Loss Function (Hinge Loss)
│   ├── Slack Variables (ξ)
│   ├── Regularization (C)
│   └── Kernel Trick
├── Decision Trees
│   ├── Entropy
│   ├── Gini Index
│   ├── Pre-Pruning vs. Post-Pruning
│   └── If-Then Rules
├── Ensemble Learning
│   ├── Bagging
│   ├── Boosting
│   └── Stacking
├── PCA
│   ├── Centralization
│   ├── Eigenvalues/Eigenvectors
│   └── Steps
└── Feature Selection
    ├── Filter Methods
    ├── Wrapper Methods
    └── Embedded Methods

You’re ready! 🎉 Just follow these steps, and you’ll be able to solve any equation from the lectures! 🚀




Here’s a simplified and clear explanation of the concepts you asked about in Chapter 6, focusing on types of learners, base learners, diversity, aggregation methods, ensemble types (homogeneous vs. heterogeneous), bagging, boosting, stacking, and their differences. Let’s break it down step by step! 🚀


1. Types of Learners 🎓

  • Base Learner: A simple model (e.g., decision tree) used in ensemble methods.
  • Strong Learner: A complex model (e.g., deep neural network) that performs well on its own.
  • Weak Learner: A simple model (e.g., decision stump) that performs slightly better than random guessing.

2. Base Learner 🛠️

  • What?: A simple model used in ensemble methods.
  • Example: Decision stump (a decision tree with only one split).

3. Diversity in Learners 🌈

  • What?: Ensures models make different errors.
  • How?:
    • Train on different subsets of data (e.g., bagging).
    • Use different algorithms (e.g., stacking).
    • Use different hyperparameters.

4. Aggregation Methods 🤝

  • What?: Combine predictions from multiple models.
  • Types:
    • Averaging: For regression (e.g., bagging).
    • Majority Voting: For classification (e.g., bagging).
    • Weighted Aggregation: Assign different weights to models (e.g., boosting).

5. Types of Ensembles 🎯

  • Homogeneous Ensembles:
    • Use the same type of base learner (e.g., all decision trees).
    • Example: Random Forest (bagging with decision trees).
  • Heterogeneous Ensembles:
    • Use different types of base learners (e.g., decision trees, SVMs, neural networks).
    • Example: Stacking (combining predictions from different models).

6. Bagging (Bootstrap Aggregating) 🎒

  • What?: Train multiple models on different subsets of data (sampled with replacement).
  • Aggregation: Average (regression) or majority vote (classification).
  • Example: Random Forest.

7. Boosting 🚀

  • What?: Sequentially train models, focusing on misclassified samples.
  • Aggregation: Weighted sum of predictions.
  • Example: AdaBoost, Gradient Boosting.

8. Stacking 🥞

  • What?: Combine predictions of multiple models using a meta-model.
  • How?:
    1. Train base models (level-0).
    2. Use their predictions as input to train a meta-model (level-1).
  • Example: Combine predictions from decision trees, SVMs, and neural networks.

9. Aggregation in Bagging vs. Stacking 🤔

  • Bagging:
    • Aggregation happens by averaging or majority voting.
  • Stacking:
    • Aggregation happens by training a meta-model on base model predictions.

10. Regression Equation 📈

  • What?: Predicts a continuous value.
  • Formula:
    y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b
    
    • w: Weights.
    • x: Features.
    • b: Bias.

11. Bagging vs. Boosting ⚖️

  • Bagging:
    • Reduces variance.
    • Parallel training of models.
    • Example: Random Forest.
  • Boosting:
    • Reduces bias.
    • Sequential training of models.
    • Example: AdaBoost.

12. Stacking 🥞

  • What?: Combines predictions from different models using a meta-model.
  • How?:
    1. Train base models (level-0).
    2. Use their predictions as input to train a meta-model (level-1).
  • Example: Combine predictions from decision trees, SVMs, and neural networks.

13. Summary Table 📊

Method Base Learners Training Aggregation Goal
Bagging Homogeneous Parallel Averaging/Voting Reduce Variance
Boosting Homogeneous Sequential Weighted Sum Reduce Bias
Stacking Heterogeneous Parallel + Meta Meta-Model Improve Accuracy

14. AdaBoost 🚀

  • What?: Focuses on misclassified samples by increasing their weights.
  • How?:
    1. Initialize equal weights for all samples.
    2. Train a model, increase weights of misclassified samples.
    3. Repeat until convergence.

15. Kernel Trick 🌀

  • What?: Transforms data into higher dimensions without explicit computation.
  • Why?: Makes it possible to handle complex, non-linear data.

Mind Map 🧠

Ensemble Methods
├── Types of Learners
│   ├── Base Learner
│   ├── Strong Learner
│   └── Weak Learner
├── Diversity
├── Aggregation Methods
│   ├── Averaging
│   ├── Majority Voting
│   └── Weighted Aggregation
├── Types of Ensembles
│   ├── Homogeneous (e.g., Random Forest)
│   └── Heterogeneous (e.g., Stacking)
├── Bagging
├── Boosting
└── Stacking

How to Solve Equations 🛠️

  1. Bagging:

    • Train multiple models on different subsets of data.
    • Aggregate predictions by averaging or majority voting.
  2. Boosting:

    • Sequentially train models, focusing on misclassified samples.
    • Aggregate predictions using weighted sum.
  3. Stacking:

    • Train base models, use their predictions to train a meta-model.

You’re ready! 🎉 Just follow these steps, and you’ll be able to solve any equation from the lectures! 🚀