18 03 Bagging and Random Forests - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Bagging

  • One algorithm, but different subsets of the training set.
  • Booststrap Aggregation.
    • Uses a technique known as the boostrap. (Sample with replacement)
    • Reduces variance of individual methods in the ensemble.
  • Classification
    • Aggregates predictions by majority voting
    • BaggingClassifier
  • Regression
    • Aggregates predictions through averaging
    • BaggingRegressor

An example of Bagging Classifier (Indian Liver Patient dataset)

  • Define the bagging classifier
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)
  • Evaluate bagging performance
# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_pred, y_test)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))  # Test set accuracy of bc: 0.71 # higher than that of a single tree (0.63)

02 Out of Bag Evaluation

Bagging

  • Some instances may be sampled several times for one model
  • Other instances may not be sampled at all

Out Of Bag (OOB) instance

  • Since OOB instances (37% on average) are not seen by a model during training, these can be used to estimate the performance of the ensemble without the need for cross-validation.
  • Evaluate model(i) on the ith OOB samples, and calculate the average OOB score of all the models.
  • set obb_score = True

Example

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, 
            n_estimators=50,
            oob_score=True,
            random_state=1)

# Fit bc to the training set 
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_pred, y_test)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob)) # Test set accuracy: 0.698, OOB accuracy: 0.704

03 Random Forests

  • Base estimator: Decision Tree
  • Each estimator is trained on a different boostrap sample having the same size as the training set
  • RF introduces further randomization in the training of individual trees
  • only 'd' features are sampled at each node without replacement
  • RandomForestClassifier & RandomForestRegressor

Feature Importance

  • accessed using the attribute: feature_importance_

Example

  • Train an RF regressor
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
            random_state=2)
            
# Fit rf to the training set    
rf.fit(X_train, y_train) 
  • Evaluate the regressor
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Predict the test set labels
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test,y_pred)**(1/2)

# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test)). # Test set RMSE of rf: 51.97
  • Visualizing features importances
# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen')
plt.title('Features Importances')
plt.show()