Lecture 04. BDT - clairedavid/ml_in_hep GitHub Wiki

XGBoost Algorithm Explained in Less Than 5 Minutes

Log Book — XGBoost, the math behind the algorithm https://towardsdatascience.com/log-book-xgboost-the-math-behind-the-algorithm-54ddc5008850

XGBoost explained https://dimleve.medium.com/xgboost-mathematics-explained-58262530904a

Gradient Boost

YouTuBe: https://www.youtube.com/watch?v=3CC4N4z3GJc

Contrary to AdaBoost, which builds another stump based on the errors made by the previous stump, Gradient Boosting starts by making a single leaf, instead of a tree or stump. This leaf represents an initial guess for the /!\ weights of all the samples.

Then Gradient Boost makes a tree. Larger than a decision stump (nb leaves between 8 and 32).

GB scales all trees by the same amount. Then GB builds another tree based on the errors made by the previous tree. Then scales the trees... and continues to build trees in this fashion.

Pseudo residual (based on Lin Reg): diff between observed and predicted values. Pseudo in Gradient Boost as we are not in lin reg.

Build a tree predicting the pseudo residuals.

Leaves nb restriction => fewer leaves than residuals. Replacing residuals with their average? Now ... (adding trees)

Low bias but very high variance. Gradient Boost deals with this by using a learning rate.

empirical evidence shows that taking lots of small steps in the right direction results in better predictions with a testing dataset, i.e. lower variance.

GB Classification

Start by an initial guess. Using GBoost for classification, the initial prediction for every samples is the log of the odds (an equivalent of the average for log reg).

log ( N[y = 1] / N[y=0])

put it initial leaf.

How to proceed now with classification?

Easiest way is to convert the log(odds) as probability. Done with the logistic function: p(y=1) = e(log(odds)) / ( 1 + e (log(odds)) = 1 / ( 1 + e( - log(odds) ) )

Since p > 0.5, we can classify everyone in the training dataset as y = 1.

But this will not be accurate for samples y = 1. An error will be made between the prediction and the observed value. We compute the error, named pseudo-residuals. These are called pseudo-residuals because..... lin-reg. Not lin-reg here. Sometimes literature quotes residuals for short ().

Pseudo-Residuals on a graph:

Now calculating residuals --> in extra column.

Then building a tree using same input features but 'fitting the residuals'.

Limiting number of leaves in the tree. Here 3. In practice 8 - 32.

Now let's calculate output value for the leaves.

A leaf with a single residual has an output value of ... ? Predictions are in terms of the log(odds). Whereas the leaf is derived from a probability.

Sum Residual_i (in the leaf)

Sum [ Previous Proba i x ( 1 - Previous Proba i) ]

(deriv in part 4)

Now we update our predictions by combining the initial leaf with the new tree.