Decision Trees Boosting - aragorn/home GitHub Wiki
- https://en.wikipedia.org/wiki/Decision_tree_learning
- http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Slides/decisiontrees-boosting.pdf
Decision Trees Boosting
Machine Learning, Carlos Guestrin, Carnegie Mellon University
Introduction
- Linear Separability, Not linearly separable data
- Addressing non-linearly separable data
- Option 1, non-linear features
- Option 2, non-linear classifier
- decision trees, neural networks, nearest neighbor, ...
- BTW. Later this semester, we’ll see that these options are not that different.
- decision trees, neural networks, nearest neighbor, ...
- A small dataset: Miles Per Gallon
- A Decision Stump
- A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves).
- Recursion Step
- Second level of tree
- The final tree
- Classification of a new example
- A Decision Stump
- Characteristics of decision trees
- Are all decision trees equal?
- Many represent the same concept, but not all trees will have the same size.
- Learning decision trees is hard!!!
- NP-complete problem [Hyafil & Rivest ’76]
- Resort to a greedy heuristic:
- Start from empty decision tree
- Split on next best attribute (feature)
- Recurse
- Choosing a good attribute
- Are all decision trees equal?
- Entropy and information gain
- Measuring uncertainty
- Good split if we are more certain about classification after split
- Deterministic good (all true or all false)
- Uniform distribution bad
- Good split if we are more certain about classification after split
- Entropy
- More uncertainty, more entropy!
- Andrew Moore’s Entropy in a nutshell
- Information Gain
- Advantage of attribute – decrease in uncertainty
- Entropy of Y before you split
- Entropy after split
- Weight by probability of following each branch, i.e., normalized number of records
- Information gain is difference
- IG(X) = H(Y) - H(Y|X)
- Advantage of attribute – decrease in uncertainty
- Measuring uncertainty
- Decision trees with IG
- Learning decision trees
- Start from empty decision tree
- Split on next best attribute (feature)
- Information Gain Example
- Look at all the information gains...
- A Decision Stump
- Base Case One - Don't split a node if all matching records have the same output value
- Base Case Two - Don't split a node if none of the attributes can create multiple non-empty children
- No attributes can distinguish
- Base Cases: An idea
- If all attributes have zero information gain then don't recurse.
- The problem with Base Case 3
- So you have to recurse on Base Case 3.
- Learning decision trees
- Basic Decision Tree Building Summarized
- BuildTree(DataSet,Output)
* If all output values are the same in DataSet, return a leaf node that says "predict this unique output"
- If all input values are the same, return a leaf node that says "predict the majority output" * Else find attribute X with highest Info Gain
- Suppose X has nX distinct values (i.e. X has arity nX).
* Create and return a non-leaf node with nX children.
- The i'th child should be built by calling BuildTree(DSi,Output) Where DSi built consists of all those records in DataSet for which X = ith distinct value of X.
- BuildTree(DataSet,Output)
* If all output values are the same in DataSet, return a leaf node that says "predict this unique output"
Real-Valued Inputs
- Real-Valued Inputs
- "One branch for each numeric value" idea: hopeless
- Threshold splits
- Binary tree, split on attribute X
- Choosing threshold split
- Binary tree, split on attribute X
- Search through possible values of t : seems hard!!!
- But only finite number of t's are important
- Sort data according to X into {x1, ..., xm}
- Consider split points of the form xi + (xi+1 - xi) / 2
- A better idea: thresholded splits
- Suppose X is real valued
- Define IG(Y|X:t) as H(Y) - H(Y|X:t)
- Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
- Then define IG'(Y|X) = max_for_t IG(Y|X:t)
- For each real-valued attribute, use IG'(Y|X) for assessing its suitability as a split
- Example tree using reals
- MPG Test set error
- Training Set: Percent Wrong 2.50
- Test Set: Percent Wrong 21.02
- The test set error is much worse than the training set error. why?
- MPG Test set error
- Decision trees & Learning Bias
- Decision trees will overfit
- Standard decision trees are have no learning biased
*
Training set error is always zero!
* Lots of variance
* Will definitely overfit!!!
- Must bias towards simpler trees
- Many strategies for picking simpler trees:
- Fixed depth
- Fixed number of leaves
- Or something smarter...
- Standard decision trees are have no learning biased
*
Training set error is always zero!
* Lots of variance
* Will definitely overfit!!!
- Decision trees will overfit
- Consider this split
- example: maker america 0:10, asia 2:5, europe 2:2 where as bad:good
- A chi-square test
- Suppose that mpg was completely uncorrelated with maker. * What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-square test, the answer is 13.5%
- Using Chi-squared to avoid overfitting
- Build the full decision tree as before
- But when you can grow it no more, start to prune
- Beginning at the bottom of the tree, delete splits in which p_of_chance > MaxP_of_chance
- Continue working you way up until there are no more prunable nodes
- MaxP_of_chance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise
- Pruning example
- With MaxP_of_chance = 0.1
- Training Set: Percent Wrong 12.50 which was 2.50
- Test Set: Percent Wrong 15.91 which was 22.02
- With MaxP_of_chance = 0.1
- MaxP_of_change
- Technical note MaxP_of_chance is a regularization parameter that helps us bias towards simpler models
- We'll learn to choose the value of these magic parameters soon!
Summary of so far
What you need to know about decision trees
- Decision trees are one of the most popular data mining tools
* Easy to understand
- Easy to implement
- Easy to use * Computationally cheap (to solve heuristically) * Information gain to select attributes (ID3, C4.5,…) * Presented for classification, can be used for regression and density estimation too * Decision trees will overfit!!! * Zero bias classifier → Lots of variance
- Must use tricks to find “simple trees”, e.g.,
- Fixed depth/Early stopping
- Pruning * Hypothesis testing
Bias-Variance Tradeoff
- Fighting the bias-variance tradeoff
- Simple (a.k.a. weak) learners are good
- e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) * Low variance, don't usually overfit
- Simple (a.k.a. weak) learners are bad
- High bias, can't solve hard learning problems
- Can we make weak learners always good???
- No!!!
- But often yes...
- Simple (a.k.a. weak) learners are good
- Boosting [Schapire, 1989]
- Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote.
- On each iteration t:
- weight each training example by how incorrectly it was classified
- Learn a hypothesis – h_of_t
- A strength for this hypothesis – alpha_of_t
- Final classifier:
- Practically useful
- Theoretically interesting
- Learning from weighted data
- Sometimes not all data points are equal
- Some data points are more equal than others
- Consider a weighted dataset
- D(i) – weight of i th training example (xi,yi)
- Now, in all calculations, whenever used, i th training example counts as D(i) "examples"
- e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count
- Sometimes not all data points are equal
- Figure 1: The boosting algorithm AdaBoost
- What alpha_of_t to choose for hypothesis h_of_t
JK: I snipped full story of equations.
- Strong, weak classifiers
- If each classifier is (at least slightly) better than random
- epsilon_of_t < 0.5
- AdaBoost will achieve zero training error (exponentially fast):
- Is it hard to achieve better than random training error?
- If each classifier is (at least slightly) better than random
- Boosting: Experimental Results
- Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets JK: I am lost...
- Boosting and Logistic Regression
- Logistic regression assumes: JK: what?
- And tries to maximize data likelihood: JK: what?
- Equivalent to minimizing log loss
- Boosting and Logistic Regression
- Logistic regression equivalent to minimizing log loss
- Boosting minimizes similar loss function
- Both smooth approximations of 0/1 loss! JK: I am lost...
Summary of Boosting
- What you need to know about Boosting
- Combine weak classifiers to obtain very strong classifier
- Weak classifier – slightly better than random on training data
- Resulting very strong classifier – can eventually provide zero training error * AdaBoost algorithm
- Boosting v. Logistic Regression
- Similar loss functions
- Single optimization (LR) v. Incrementally improving classification (B)
- Most popular application of Boosting:
- Boosted decision stumps!
- Very simple to implement, very effective classifier
- Combine weak classifiers to obtain very strong classifier
Acknowledgements
- Much of the decision trees material in the presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials:
- Much of the boosting material in the presentation is courtesy of Tom Mitchell