Tree based Methods - niranjv/ml-notes GitHub Wiki

Overview

Advantages

Easy interpretability & visualization
Can handle qualitative predictors without having to create dummy variables

Disadvantages

Predictive performance of a single tree is not as good as other regression or classification methods
Tree structure is not robust and different training data can result in very different trees (i.e., high variance)

Decision Trees

Used for both regression & classification
Segment covariate space into numerous simple regions; Predict response (class) for a new data point by taking average (majority) value of training data points for the region to which the new data point belongs.
Set of splitting rules can be summarized as a tree, ergo decision tree.
Basic tree-based methods are simple & easily visualized but don't perform as well as other supervised learning methods, so need to do bagging, boosting, random forests, etc., which use an ensemble of trees to create a consensus prediction. This usually results in much higher accuracy but less interpretability.
Regions can be any shape, but are usually boxes for simplicity & ease of interpretation
Boxes are determined s.t. RSS is minimized, but not possible to consider every combination of boxes
So follow greedy top-down approach called recursive binary splitting
Start with the covariate and cut point that minimizes RSS
Split each of the sub-regions as needed using remaining covariates until a stopping criterion is reached (e.g., each terminal node has at most a certain number of points)
Trees that are too complex have low bias but result in overfitting
So grow a large tree and then prune it to get optimal sub-tree. Not practical to check test error in all possible sub-tree since this number is too large.
Use cost complexity pruning / weakest link pruning, which uses a parameter $\alpha$ to control complexity (like lasso linear regression).
$\alpha$ is selected using cross-validation
CV error is a good approximation to test error
For classification trees, instead of RSS, use classification error rate. But this is not sensitive enough to evaluate quality of split for growing trees.
So use Gini index or cross-entropy (measures of node purity since they will be low when a node contains mostly data from 1 class)
classification error rate is best for prediction but all 3 are good for pruning.

Bagging

Decision trees have low bias but high variance, i.e., tree structure is very sensitive to training data
Bootstrap aggregation or bagging is a method to reduce variance of a statistical learning method
To reduce variance / increase predictive accuracy of a method, take many training sets, build a predictive model for each set and average their predictions
But many training sets are usually not available, so use bootstrap samples from single training set instead & average predictions from each sample
Bagging can be used for many regression methods and works well for decision trees. For classification, use majority vote among bootstrapped trees to determine class of new observation.
With bagging, individual trees based on bootstrap samples are not pruned so they have low bias and high variance
The number of trees/bootstrapped samples used is arbitrary and does not lead to overfitting

Out-of-Bag Error

Alternate method to cross-validation / using validation set to estimate test error
On average, each bagged tree uses 2/3rd of training set data. Remaining 1/3rd are out-of-bag (OOB) data.
For $i^th$ data point, predict response using only the trees in which this point was not included (OOB). This is OOB response for the point
Get OOB prediction for all training data points and calculate OOB MSE / classification error
OOB error is a valid estimate of test error since prediction was based only on trees which did not contain the point being predicted
With sufficiently large number of trees, OOB is equivalent to leave-one-out cross-validation error
Using OOB estimate of test error is very useful for large datasets where cross-validation is not practical

Variable Importance

Disadvantages of bagging - interpretability suffers.
Cannot represent results of bagging in tree form and difficult to determine which covariates are important.
Can estimate variable importance using RSS & Gini index.
Track average decrease in RSS/Gini index due to splits on a variable over all bagged trees. Large decrease => important variable

Random Forests

Covariates that are strongly associated with the response will dominate tree growth (i.e., will always be picked first for splits), resulting in bagged trees with similar structure
These trees will have correlated predictions and averaging over these will not reduce variance as much as averaging over uncorrelated trees
So variance of prediction from these bagged trees will not be much better than that from a single tree
Random forests fixes bagging by decorrelating trees
Build number of trees on bootstrapped training samples like in bagging but for each split in a tree, pick a covariate only from a random subset m of all p covariates to split on
Pick a different random subset m of p for each split of each tree. Typically m = sqrt(p) for regression & m = p/3 for classification. m = p => bagging.
This method gives all covariates a chance to be used for a split since, on average, 1 - m/p covariates will not have the dominant covariate
With dominant covariates, average of decorrelated trees is less variable than that from bagging
Similar to bagging, using large number of trees will not result in overfitting

Parameters

Number of trees in forest (B):
Depth of trees (d) (i.e., min samples / leaf): Tune using CV
Number of features to use in each tree (m): Tune using CV. Small m will reduce variance of individual tree but increase bias. Also depends on number of noisy features vs. informative features; with more noisy features, small m will result in less likelihood of picking an informative feature during a split.

Boosting

Another approach to improving predictions for statistical learning methods for regression & classification
Instead, boosting grows trees sequentially, using info from previous grown trees
Boosting does not use bootstrap samples; each tree is fit on a modified version of the original data
Boosting works by fitting each tree to the residuals from the previous tree as the response
The new tree is added to the fitted function and its residuals are used as response for the next tree
Each tree in this sequence can be quite small with only a few terminal nodes. This allows us to improve the fitted function in areas where it is not doing well
A shrinkage parameters slows down the learning process and allows different types of trees to be fit
Boosting parameters are:
- B - number of trees to fit. Large B can result in overfitting. B is selected with cross-validation
- $\lambda$ - shrinkage parameter that controls the learning rate. Usually set to 0.01 or 0.001
- d - number of splits in the tree; controls interaction depth & complexity of final model. d = 1 => tree is a stump => final model is an additive model since each term/tree involves only a single covariate. A tree with d splits can involve at most d covariates.
Because each tree depends on previous trees, smaller trees are usually sufficient and this leads to better interpretability (e.g., using stumps => additive model)

Tree based Methods - niranjv/ml-notes GitHub Wiki

Overview

Decision Trees

Bagging

Out-of-Bag Error

Variable Importance

Random Forests

Parameters

Boosting

References