Boosting - aragorn/home GitHub Wiki

https://datamining.bus.utk.edu/Documents/Decision-Trees-for-Predictive-Modeling-(Neville).pdf
Decision Trees for Predictive Modeling by Padraic G. Neville SAS Institute Inc. 4 August 1999

Bagging

The instability of an algorithm makes possible a different strategy for creating an ensemble of models: instead of varying the algorithm, apply the same algorithm to different samples of the original data. If the algorithm is unstable, the different samples yield different models. The average of the predictions of these models might be better than the predictions from any single model. If the algorithm is stable, then the different samples yield similar models, and their average is no better than a single model.

The usual procedure for creating one of the sampled training sets is as follows. Pick a data case at random from the original training data set and copy it to the sampled data set. Repeat this a second time. The same case might be copied twice, but more likely two different cases will be chosen. Keep repeating this process until the sampled data set contains N cases, where N is arbitrary but is usually equal to the number of cases in the original training data set. This process is called sampling with replacement.

"Bagging" refers to the creation of an ensemble of models using the same algorithm on data sets sampled with replacement from a single training data set. Bagging stands for "bootstrap aggregation," and it is the invention of Leo Breiman (1996). He uses the voting method for classification.

Arcing

Bagging uses independent samples of the original data. Arcing takes a sample that favors cases that the current ensemble predicts relatively poorly. The first model trains on all of the original data. Cases that the first model incorrectly classify are preferentially sampled for the second training data set. A second model is trained, and the two models are combined. Cases for which the combined model performs poorly are preferentially sampled for the third training data set, and so on.

Leo Breiman (1998) introduced the term "arcing" to stand for "adaptive resampling and combining," and this seemed a better description of the process of "boosting" introduced by Freund and Schapire (1995, 1996). Unlike bagging, arcing may give different weights to the individual models. Specifying the weights and the probabilities for sampling a case completely specifies an arcing algorithm. No particular specification seems generally superior to others in practice. Arcing applies only to categorical targets.

Boosting

The term "boosting" is much more commonly used than "arcing" and generally means the same thing. However, Friedman, Hastie, and Tibshirani (1998, 1999) have reformulated the concept without reformulating a new name. Boosting, as reformulated, creates an ensemble consisting of a weighted average of models fit to data in which the target values are changed in a special way. In all other respects, the data are the same for all models.

As an example, consider minimizing the sum of squared errors between an interval target and a prediction from an ensemble. The first individual model is fit to the original data. For the training data set for the second model, subtract the predicted value from the original target value, and use the difference as the target value to train the second model. For the third training set, subtract a weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on.

The weights used in averaging the individual models and the prescription for creating a target value depend on the measure of loss associated with the problem. The prescription just described is used when minimizing the sum of squared errors. A different prescription is used when minimizing the sum of absolute errors. Thus, the measure of loss determines the particulars of a boosting algorithm.