ML2 ‐ Lec (3) - RenadShamrani/test GitHub Wiki
1. Decision Trees 🌳
- What?: A tree-like model for classification/regression.
- Goal: Build the smallest possible tree that fits the data.
- Nodes: Test attributes.
- Branches: Attribute values.
- Leaves: Class labels or predictions.
2. ID3 Algorithm 🛠️
- Steps:
- Start at the root.
- Choose the best attribute (max info gain).
- Split data based on attribute values.
- Repeat for each branch.
- Stopping Criteria:
- All examples in a branch are the same class.
- No more attributes to split.
- Assign majority class if no data.
3. Entropy & Information Gain 📊
- Entropy: Measures impurity/uncertainty.
Entropy(S) = -p_+ \log_2 p_+ - p_- \log_2 p_-
p_+
: Proportion of positive examples.p_-
: Proportion of negative examples.
- Information Gain:
Gain(S, A) = Entropy(S) - \sum_{v} \frac{|S_v|}{|S|} Entropy(S_v)
A
: Attribute.S_v
: Subset of data for valuev
.
4. Overfitting & Pruning ✂️
- Overfitting: Tree too complex → fits noise.
- Pruning:
- Pre-pruning: Stop early (e.g., min samples per leaf).
- Post-pruning: Grow full tree, then remove nodes.
- Goal: Simplify tree to improve generalization.
5. Extensions 🔄
- Continuous Attributes: Discretize using thresholds.
- Missing Values: Use most frequent value or probability estimates.
- Cost-Sensitive Attributes: Modify gain to account for feature costs.
- Regression Trees: Predict numeric values (average in leaves).
6. Key Concepts 🔑
- Gini Index: Alternative to entropy for impurity.
Gini(S) = 1 - \sum p_i^2
- Gain Ratio: Adjusts info gain to penalize many-valued attributes.
GainRatio(S, A) = \frac{Gain(S, A)}{SplitInformation(S, A)}
- Multivariate Trees: Use linear combinations of attributes.
Mind Map 🧠
Decision Trees
├── ID3 Algorithm
│ ├── Entropy (impurity measure)
│ ├── Information Gain (choose best attribute)
│ └── Stopping Criteria (pure branch, no attributes)
├── Overfitting
│ ├── Pre-pruning (stop early)
│ └── Post-pruning (grow full, then cut)
└── Extensions
├── Continuous Attributes (discretize)
├── Missing Values (use most frequent)
├── Regression Trees (predict numeric values)
└── Multivariate Trees (linear combinations)
Key Symbols 🔑
S
: Dataset.A
: Attribute.p_+
: Proportion of positive examples.p_-
: Proportion of negative examples.Gain(S, A)
: Information gain for attributeA
.Gini(S)
: Gini impurity for datasetS
.
You’re ready! 🎉 Just remember Decision Trees = split data based on attributes, Entropy = measure of impurity, and Pruning = avoid overfitting! 🚀
1. Decision Trees Extensions 🌳
- Gain Ratio: Adjusts info gain to penalize attributes with many values.
GainRatio(S, A) = \frac{Gain(S, A)}{SplitInformation(S, A)}
- Continuous Attributes: Discretize using thresholds (e.g., Temperature > 54).
- Missing Values: Use most frequent value or probability estimates.
- Cost-Sensitive Attributes: Modify gain to account for feature costs.
Gain2(S, A) = \frac{Gain(S, A)^2}{Cost(S, A)}
2. Multiclass Classification 🎯
- Entropy for Multiple Classes:
Entropy(S) = -\sum_{i=1}^c p_i \log_2 p_i
c
: Number of classes.p_i
: Proportion of classi
.
3. Regression Trees 📈
- Goal: Predict continuous values.
- Splitting Criterion: Minimize variance (standard deviation reduction).
SDR(S, A) = SD(S) - \sum_{v} \frac{|S_v|}{|S|} SD(S_v)
- Prediction: Mean value in leaf nodes.
4. CART (Classification and Regression Trees) 🛠️
- Gini Index: Measures impurity.
Gini(S) = 1 - \sum_{i=1}^c p_i^2
- Weighted Gini:
Gini_{split} = \frac{N_1}{N} Gini(S_1) + \frac{N_2}{N} Gini(S_2)
- Regression: Use Mean Squared Error (MSE) for splitting.
MSE = \frac{1}{N} \sum (y_i - \hat{y})^2
5. Key Concepts 🔑
- Gain Ratio: Penalizes attributes with many values.
- Continuous Attributes: Discretize using thresholds.
- Missing Values: Use most frequent value or probability estimates.
- Regression Trees: Predict numeric values (mean in leaves).
- CART: Uses Gini Index for classification, MSE for regression.
Mind Map 🧠
Decision Trees Extensions
├── Gain Ratio (penalize many-valued attributes)
├── Continuous Attributes (discretize using thresholds)
├── Missing Values (use most frequent or probability)
├── Cost-Sensitive Attributes (modify gain with cost)
├── Multiclass Classification (entropy for multiple classes)
└── Regression Trees
├── Splitting Criterion (minimize variance)
├── Prediction (mean in leaves)
└── CART (Gini Index for classification, MSE for regression)
Key Symbols 🔑
S
: Dataset.A
: Attribute.Gain(S, A)
: Information gain for attributeA
.Gini(S)
: Gini impurity for datasetS
.MSE
: Mean Squared Error (for regression).
You’re ready! 🎉 Just remember Decision Trees = split data based on attributes, Gain Ratio = penalize many-valued attributes, and Regression Trees = predict numeric values! 🚀