Decision Trees Boosting - aragorn/home GitHub Wiki

Decision Trees Boosting

Machine Learning, Carlos Guestrin, Carnegie Mellon University

Introduction

  • Linear Separability, Not linearly separable data
  • Addressing non-linearly separable data
    • Option 1, non-linear features
    • Option 2, non-linear classifier
      • decision trees, neural networks, nearest neighbor, ...
        • BTW. Later this semester, we’ll see that these options are not that different.
  • A small dataset: Miles Per Gallon
    • A Decision Stump
      • A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves).
    • Recursion Step
    • Second level of tree
    • The final tree
    • Classification of a new example
  • Characteristics of decision trees
    • Are all decision trees equal?
      • Many represent the same concept, but not all trees will have the same size.
    • Learning decision trees is hard!!!
      • NP-complete problem [Hyafil & Rivest ’76]
      • Resort to a greedy heuristic:
        • Start from empty decision tree
        • Split on next best attribute (feature)
        • Recurse
    • Choosing a good attribute
  • Entropy and information gain
    • Measuring uncertainty
      • Good split if we are more certain about classification after split
        • Deterministic good (all true or all false)
        • Uniform distribution bad
    • Entropy
      • More uncertainty, more entropy!
    • Andrew Moore’s Entropy in a nutshell
    • Information Gain
      • Advantage of attribute – decrease in uncertainty
        • Entropy of Y before you split
        • Entropy after split
          • Weight by probability of following each branch, i.e., normalized number of records
      • Information gain is difference
        • IG(X) = H(Y) - H(Y|X)
  • Decision trees with IG
    • Learning decision trees
      • Start from empty decision tree
      • Split on next best attribute (feature)
    • Information Gain Example
      • Look at all the information gains...
    • A Decision Stump
    • Base Case One - Don't split a node if all matching records have the same output value
    • Base Case Two - Don't split a node if none of the attributes can create multiple non-empty children
      • No attributes can distinguish
    • Base Cases: An idea
      • If all attributes have zero information gain then don't recurse.
      • The problem with Base Case 3
      • So you have to recurse on Base Case 3.
  • Basic Decision Tree Building Summarized
    • BuildTree(DataSet,Output) *„ If all output values are the same in DataSet, return a leaf node that says "predict this unique output"
      • If all input values are the same, return a leaf node that says "predict the majority output" *„ Else find attribute X with highest Info Gain
      • Suppose X has nX distinct values (i.e. X has arity nX). … * Create and return a non-leaf node with nX children.
        • The i'th child should be built by calling BuildTree(DSi,Output) Where DSi built consists of all those records in DataSet for which X = ith distinct value of X.

Real-Valued Inputs

  • Real-Valued Inputs
    • "One branch for each numeric value" idea: hopeless
  • Threshold splits
    • Binary tree, split on attribute X
  • Choosing threshold split
    • Binary tree, split on attribute X
    • Search through possible values of t : seems hard!!!
    • But only finite number of t's are important
      • Sort data according to X into {x1, ..., xm}
      • Consider split points of the form xi + (xi+1 - xi) / 2
  • A better idea: thresholded splits
    • Suppose X is real valued
    • Define IG(Y|X:t) as H(Y) - H(Y|X:t)
    • Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
    • Then define IG'(Y|X) = max_for_t IG(Y|X:t)
    • For each real-valued attribute, use IG'(Y|X) for assessing its suitability as a split
  • Example tree using reals
    • MPG Test set error
      • Training Set: Percent Wrong 2.50
      • Test Set: Percent Wrong 21.02
    • The test set error is much worse than the training set error. why?
  • Decision trees & Learning Bias
    • Decision trees will overfit
      • Standard decision trees are have no learning biased *… Training set error is always zero! … * Lots of variance … * Will definitely overfit!!!
        • Must bias towards simpler trees
      • Many strategies for picking simpler trees:
        • Fixed depth
        • Fixed number of leaves
        • Or something smarter...
  • Consider this split
    • example: maker america 0:10, asia 2:5, europe 2:2 where as bad:good
    • A chi-square test
      • Suppose that mpg was completely uncorrelated with maker. „ * What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-square test, the answer is 13.5%
  • Using Chi-squared to avoid overfitting
    • Build the full decision tree as before
    • But when you can grow it no more, start to prune
      • Beginning at the bottom of the tree, delete splits in which p_of_chance > MaxP_of_chance
      • Continue working you way up until there are no more prunable nodes
    • MaxP_of_chance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise
  • Pruning example
    • With MaxP_of_chance = 0.1
      • Training Set: Percent Wrong 12.50 which was 2.50
      • Test Set: Percent Wrong 15.91 which was 22.02
  • MaxP_of_change
    • Technical note MaxP_of_chance is a regularization parameter that helps us bias towards simpler models
    • We'll learn to choose the value of these magic parameters soon!

Summary of so far

What you need to know about decision trees

  • Decision trees are one of the most popular data mining tools … * Easy to understand
    • Easy to implement
    • Easy to use … * Computationally cheap (to solve heuristically) „* Information gain to select attributes (ID3, C4.5,…) „* Presented for classification, can be used for regression and density estimation too *„ Decision trees will overfit!!! … * Zero bias classifier → Lots of variance
    • Must use tricks to find “simple trees”, e.g.,
      • Fixed depth/Early stopping
      • Pruning „ * Hypothesis testing

Bias-Variance Tradeoff

  • Fighting the bias-variance tradeoff
    • Simple (a.k.a. weak) learners are good
      • e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) *… Low variance, don't usually overfit
    • Simple (a.k.a. weak) learners are bad
      • High bias, can't solve hard learning problems
    • Can we make weak learners always good???
      • No!!!
      • But often yes...
  • Boosting [Schapire, 1989]
    • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote.
    • On each iteration t:
      • weight each training example by how incorrectly it was classified
      • Learn a hypothesis – h_of_t
      • A strength for this hypothesis – alpha_of_t
    • Final classifier:
      • Practically useful
      • Theoretically interesting
  • Learning from weighted data
    • Sometimes not all data points are equal
      • Some data points are more equal than others
    • Consider a weighted dataset
      • D(i) – weight of i th training example (xi,yi)
    • Now, in all calculations, whenever used, i th training example counts as D(i) "examples"
      • e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count
  • Figure 1: The boosting algorithm AdaBoost
  • What alpha_of_t to choose for hypothesis h_of_t

JK: I snipped full story of equations.

  • Strong, weak classifiers
    • If each classifier is (at least slightly) better than random
      • epsilon_of_t < 0.5
    • AdaBoost will achieve zero training error (exponentially fast):
    • Is it hard to achieve better than random training error?
  • Boosting: Experimental Results
    • Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets JK: I am lost...
  • Boosting and Logistic Regression
    • Logistic regression assumes: JK: what?
    • And tries to maximize data likelihood: JK: what?
    • Equivalent to minimizing log loss
  • Boosting and Logistic Regression
    • Logistic regression equivalent to minimizing log loss
    • Boosting minimizes similar loss function
    • Both smooth approximations of 0/1 loss! JK: I am lost...

Summary of Boosting

  • What you need to know about Boosting
    • Combine weak classifiers to obtain very strong classifier
      • Weak classifier – slightly better than random on training data
      • Resulting very strong classifier – can eventually provide zero training error *„ AdaBoost algorithm
    • Boosting v. Logistic Regression
      • Similar loss functions
      • Single optimization (LR) v. Incrementally improving classification (B)
    • Most popular application of Boosting:
      • Boosted decision stumps!
      • Very simple to implement, very effective classifier

Acknowledgements

  • Much of the decision trees material in the presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials:…
  • Much of the boosting material in the presentation is courtesy of Tom Mitchell