Decision Trees Boosting - aragorn/home GitHub Wiki

Decision Trees Boosting

Machine Learning, Carlos Guestrin, Carnegie Mellon University

Introduction

Linear Separability, Not linearly separable data
Addressing non-linearly separable data
- Option 1, non-linear features
- Option 2, non-linear classifier
  - decision trees, neural networks, nearest neighbor, ...
    - BTW. Later this semester, we’ll see that these options are not that different.
A small dataset: Miles Per Gallon
- A Decision Stump
  - A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves).
- Recursion Step
- Second level of tree
- The final tree
- Classification of a new example
Characteristics of decision trees
- Are all decision trees equal?
  - Many represent the same concept, but not all trees will have the same size.
- Learning decision trees is hard!!!
  - NP-complete problem [Hyafil & Rivest ’76]
  - Resort to a greedy heuristic:
    - Start from empty decision tree
    - Split on next best attribute (feature)
    - Recurse
- Choosing a good attribute
Entropy and information gain
- Measuring uncertainty
  - Good split if we are more certain about classification after split
    - Deterministic good (all true or all false)
    - Uniform distribution bad
- Entropy
  - More uncertainty, more entropy!
- Andrew Moore’s Entropy in a nutshell
- Information Gain
  - Advantage of attribute – decrease in uncertainty
    - Entropy of Y before you split
    - Entropy after split
      - Weight by probability of following each branch, i.e., normalized number of records
  - Information gain is difference
    - IG(X) = H(Y) - H(Y|X)
Decision trees with IG
- Learning decision trees
  - Start from empty decision tree
  - Split on next best attribute (feature)
- Information Gain Example
  - Look at all the information gains...
- A Decision Stump
- Base Case One - Don't split a node if all matching records have the same output value
- Base Case Two - Don't split a node if none of the attributes can create multiple non-empty children
  - No attributes can distinguish
- Base Cases: An idea
  - If all attributes have zero information gain then don't recurse.
  - The problem with Base Case 3
  - So you have to recurse on Base Case 3.
Basic Decision Tree Building Summarized
- BuildTree(DataSet,Output) * If all output values are the same in DataSet, return a leaf node that says "predict this unique output"
  - If all input values are the same, return a leaf node that says "predict the majority output" * Else find attribute X with highest Info Gain
  - Suppose X has nX distinct values (i.e. X has arity nX). * Create and return a non-leaf node with nX children.
    - The i'th child should be built by calling BuildTree(DSi,Output) Where DSi built consists of all those records in DataSet for which X = ith distinct value of X.

Real-Valued Inputs

Real-Valued Inputs
- "One branch for each numeric value" idea: hopeless
Threshold splits
- Binary tree, split on attribute X
Choosing threshold split
- Binary tree, split on attribute X
- Search through possible values of t : seems hard!!!
- But only finite number of t's are important
  - Sort data according to X into {x1, ..., xm}
  - Consider split points of the form xi + (xi+1 - xi) / 2
A better idea: thresholded splits
- Suppose X is real valued
- Define IG(Y|X:t) as H(Y) - H(Y|X:t)
- Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
- Then define IG'(Y|X) = max_for_t IG(Y|X:t)
- For each real-valued attribute, use IG'(Y|X) for assessing its suitability as a split
Example tree using reals
- MPG Test set error
  - Training Set: Percent Wrong 2.50
  - Test Set: Percent Wrong 21.02
- The test set error is much worse than the training set error. why?
Decision trees & Learning Bias
- Decision trees will overfit
  - Standard decision trees are have no learning biased * Training set error is always zero! * Lots of variance * Will definitely overfit!!!
    - Must bias towards simpler trees
  - Many strategies for picking simpler trees:
    - Fixed depth
    - Fixed number of leaves
    - Or something smarter...
Consider this split
- example: maker america 0:10, asia 2:5, europe 2:2 where as bad:good
- A chi-square test
  - Suppose that mpg was completely uncorrelated with maker. * What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-square test, the answer is 13.5%
Using Chi-squared to avoid overfitting
- Build the full decision tree as before
- But when you can grow it no more, start to prune
  - Beginning at the bottom of the tree, delete splits in which p_of_chance > MaxP_of_chance
  - Continue working you way up until there are no more prunable nodes
- MaxP_of_chance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise
Pruning example
- With MaxP_of_chance = 0.1
  - Training Set: Percent Wrong 12.50 which was 2.50
  - Test Set: Percent Wrong 15.91 which was 22.02
MaxP_of_change
- Technical note MaxP_of_chance is a regularization parameter that helps us bias towards simpler models
- We'll learn to choose the value of these magic parameters soon!

Summary of so far

What you need to know about decision trees

Decision trees are one of the most popular data mining tools * Easy to understand
- Easy to implement
- Easy to use * Computationally cheap (to solve heuristically) * Information gain to select attributes (ID3, C4.5,…) * Presented for classification, can be used for regression and density estimation too * Decision trees will overfit!!! * Zero bias classifier → Lots of variance
- Must use tricks to find “simple trees”, e.g.,
  - Fixed depth/Early stopping
  - Pruning * Hypothesis testing

Bias-Variance Tradeoff

Fighting the bias-variance tradeoff
- Simple (a.k.a. weak) learners are good
  - e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) * Low variance, don't usually overfit
- Simple (a.k.a. weak) learners are bad
  - High bias, can't solve hard learning problems
- Can we make weak learners always good???
  - No!!!
  - But often yes...
Boosting [Schapire, 1989]
- Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote.
- On each iteration t:
  - weight each training example by how incorrectly it was classified
  - Learn a hypothesis – h_of_t
  - A strength for this hypothesis – alpha_of_t
- Final classifier:
  - Practically useful
  - Theoretically interesting
Learning from weighted data
- Sometimes not all data points are equal
  - Some data points are more equal than others
- Consider a weighted dataset
  - D(i) – weight of i th training example (xi,yi)
- Now, in all calculations, whenever used, i th training example counts as D(i) "examples"
  - e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count
Figure 1: The boosting algorithm AdaBoost
What alpha_of_t to choose for hypothesis h_of_t

JK: I snipped full story of equations.

Strong, weak classifiers
- If each classifier is (at least slightly) better than random
  - epsilon_of_t < 0.5
- AdaBoost will achieve zero training error (exponentially fast):
- Is it hard to achieve better than random training error?
Boosting: Experimental Results
- Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets JK: I am lost...
Boosting and Logistic Regression
- Logistic regression assumes: JK: what?
- And tries to maximize data likelihood: JK: what?
- Equivalent to minimizing log loss
Boosting and Logistic Regression
- Logistic regression equivalent to minimizing log loss
- Boosting minimizes similar loss function
- Both smooth approximations of 0/1 loss! JK: I am lost...

Summary of Boosting

What you need to know about Boosting
- Combine weak classifiers to obtain very strong classifier
  - Weak classifier – slightly better than random on training data
  - Resulting very strong classifier – can eventually provide zero training error * AdaBoost algorithm
- Boosting v. Logistic Regression
  - Similar loss functions
  - Single optimization (LR) v. Incrementally improving classification (B)
- Most popular application of Boosting:
  - Boosted decision stumps!
  - Very simple to implement, very effective classifier

Acknowledgements

Much of the decision trees material in the presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials:
- http://www.cs.cmu.edu/~awm/tutorials
Much of the boosting material in the presentation is courtesy of Tom Mitchell