ML2 ‐ Lec (3) - RenadShamrani/test GitHub Wiki

1. Decision Trees 🌳

  • What?: A tree-like model for classification/regression.
  • Goal: Build the smallest possible tree that fits the data.
  • Nodes: Test attributes.
  • Branches: Attribute values.
  • Leaves: Class labels or predictions.

2. ID3 Algorithm 🛠️

  • Steps:
    1. Start at the root.
    2. Choose the best attribute (max info gain).
    3. Split data based on attribute values.
    4. Repeat for each branch.
  • Stopping Criteria:
    • All examples in a branch are the same class.
    • No more attributes to split.
    • Assign majority class if no data.

3. Entropy & Information Gain 📊

  • Entropy: Measures impurity/uncertainty.
    Entropy(S) = -p_+ \log_2 p_+ - p_- \log_2 p_-
    
    • p_+: Proportion of positive examples.
    • p_-: Proportion of negative examples.
  • Information Gain:
    Gain(S, A) = Entropy(S) - \sum_{v} \frac{|S_v|}{|S|} Entropy(S_v)
    
    • A: Attribute.
    • S_v: Subset of data for value v.

4. Overfitting & Pruning ✂️

  • Overfitting: Tree too complex → fits noise.
  • Pruning:
    • Pre-pruning: Stop early (e.g., min samples per leaf).
    • Post-pruning: Grow full tree, then remove nodes.
  • Goal: Simplify tree to improve generalization.

5. Extensions 🔄

  • Continuous Attributes: Discretize using thresholds.
  • Missing Values: Use most frequent value or probability estimates.
  • Cost-Sensitive Attributes: Modify gain to account for feature costs.
  • Regression Trees: Predict numeric values (average in leaves).

6. Key Concepts 🔑

  • Gini Index: Alternative to entropy for impurity.
    Gini(S) = 1 - \sum p_i^2
    
  • Gain Ratio: Adjusts info gain to penalize many-valued attributes.
    GainRatio(S, A) = \frac{Gain(S, A)}{SplitInformation(S, A)}
    
  • Multivariate Trees: Use linear combinations of attributes.

Mind Map 🧠

Decision Trees
├── ID3 Algorithm
│   ├── Entropy (impurity measure)
│   ├── Information Gain (choose best attribute)
│   └── Stopping Criteria (pure branch, no attributes)
├── Overfitting
│   ├── Pre-pruning (stop early)
│   └── Post-pruning (grow full, then cut)
└── Extensions
    ├── Continuous Attributes (discretize)
    ├── Missing Values (use most frequent)
    ├── Regression Trees (predict numeric values)
    └── Multivariate Trees (linear combinations)

Key Symbols 🔑

  • S: Dataset.
  • A: Attribute.
  • p_+: Proportion of positive examples.
  • p_-: Proportion of negative examples.
  • Gain(S, A): Information gain for attribute A.
  • Gini(S): Gini impurity for dataset S.

You’re ready! 🎉 Just remember Decision Trees = split data based on attributes, Entropy = measure of impurity, and Pruning = avoid overfitting! 🚀


1. Decision Trees Extensions 🌳

  • Gain Ratio: Adjusts info gain to penalize attributes with many values.
    GainRatio(S, A) = \frac{Gain(S, A)}{SplitInformation(S, A)}
    
  • Continuous Attributes: Discretize using thresholds (e.g., Temperature > 54).
  • Missing Values: Use most frequent value or probability estimates.
  • Cost-Sensitive Attributes: Modify gain to account for feature costs.
    Gain2(S, A) = \frac{Gain(S, A)^2}{Cost(S, A)}
    

2. Multiclass Classification 🎯

  • Entropy for Multiple Classes:
    Entropy(S) = -\sum_{i=1}^c p_i \log_2 p_i
    
    • c: Number of classes.
    • p_i: Proportion of class i.

3. Regression Trees 📈

  • Goal: Predict continuous values.
  • Splitting Criterion: Minimize variance (standard deviation reduction).
    SDR(S, A) = SD(S) - \sum_{v} \frac{|S_v|}{|S|} SD(S_v)
    
  • Prediction: Mean value in leaf nodes.

4. CART (Classification and Regression Trees) 🛠️

  • Gini Index: Measures impurity.
    Gini(S) = 1 - \sum_{i=1}^c p_i^2
    
  • Weighted Gini:
    Gini_{split} = \frac{N_1}{N} Gini(S_1) + \frac{N_2}{N} Gini(S_2)
    
  • Regression: Use Mean Squared Error (MSE) for splitting.
    MSE = \frac{1}{N} \sum (y_i - \hat{y})^2
    

5. Key Concepts 🔑

  • Gain Ratio: Penalizes attributes with many values.
  • Continuous Attributes: Discretize using thresholds.
  • Missing Values: Use most frequent value or probability estimates.
  • Regression Trees: Predict numeric values (mean in leaves).
  • CART: Uses Gini Index for classification, MSE for regression.

Mind Map 🧠

Decision Trees Extensions
├── Gain Ratio (penalize many-valued attributes)
├── Continuous Attributes (discretize using thresholds)
├── Missing Values (use most frequent or probability)
├── Cost-Sensitive Attributes (modify gain with cost)
├── Multiclass Classification (entropy for multiple classes)
└── Regression Trees
    ├── Splitting Criterion (minimize variance)
    ├── Prediction (mean in leaves)
    └── CART (Gini Index for classification, MSE for regression)

Key Symbols 🔑

  • S: Dataset.
  • A: Attribute.
  • Gain(S, A): Information gain for attribute A.
  • Gini(S): Gini impurity for dataset S.
  • MSE: Mean Squared Error (for regression).

You’re ready! 🎉 Just remember Decision Trees = split data based on attributes, Gain Ratio = penalize many-valued attributes, and Regression Trees = predict numeric values! 🚀