Concepts when Building a Random Forest model - Nori12/Machine-Learning-Tutorial GitHub Wiki
Machine Learning Tutorial
Concepts when Building a Random Forest model
To build a random forest model, you need to decide on the number of trees to build (the n_estimators parameter of RandomForestRegressor or RandomForestClassifier). A common rule of thumb is to build “as many as you have time/memory for.”
To build a tree, we first take what is called a Bootstrap Sample of our data.
Next, a decision tree is built based on this newly created dataset. But the algorithm we described for the decision tree is slightly modified. Instead of looking for the best test for each node, in each node the algorithm randomly selects a subset of the features, and it looks for the best possible test involving one of these features. The number of features that are selected is controlled by the max_features parameter. This selection of a subset of features is repeated separately in each node, so that each node in a tree can make a decision using a different subset of the features.
Together, these two mechanisms (bootstrap sample and the algorithm modification) ensure that all the trees in the random forest are different.
A critical parameter in this process is max_features. If we set max_features to n_features, that means that each split can look at all features in the dataset, and no randomness will be injected in the feature selection (the randomness due to the bootstrapping remains, though). On the other hand, if we set max_features to 1, that means that the splits have no choice at all on which feature to test, and can only search over different thresholds for the feature that was selected randomly.
- High max_features -> trees quite similar; they will be able to fit the data easily, using the most distinctive features.
- Low max_features -> trees very different; each tree might need to be very deep in order to fit the data well; Reduces overfitting;
In general, it’s a good rule of thumb to use the default values: max_features=sqrt(n_features) for classification and max_feaures=log2(n_features) for regression. Adding max_features or max_leaf_nodes might sometimes improve performance. It can also drastically reduce space and time requirements for training and prediction.
To make a prediction using the random forest, the algorithm first makes a prediction for every tree in the forest. For regression, we can average these results to get our final prediction. For classification, a “soft voting” strategy is used. This means each algorithm makes a “soft” prediction, providing a probability for each possible output label. The probabilities predicted by all the trees are averaged, and the class with the highest probability is predicted.