18 05 Model Tuning - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Tuning a CART's hyperparameters

parameter examples of CART: split-point of a node, split-feature of a node, ...
hyperparameter examples of CART: max_depth, min_sample_leaf, splitting criterion
Grid search cross validation
- Score: in sklearn defaults to accuracy (classification) and R^2 (regression).

Example

Set the tree's hyperparameter grid
- .get_params

# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf':[0.12, 0.14, 0.16, 0.18]
}

Search for the optimal tree

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

Evaluate the optimal tree
- ROC AUC score, unbalanced dataset
- Extract the best hyperparameters :.best_params_
- Extract the best estimator : .best_estimator_

# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

02 Tuning an RF's Hyperparameters

CART hyperparameters
number of estimators
boostrap

Tuning is expensive

Computationally expensive
sometimes leads to very slightly improvement
Weight the impact of tuning on the whole object.

Example

Set the hyperparameter grid of RF

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators':[100, 350, 500],
    'max_features':['log2', 'auto', 'sqrt'],
    'min_samples_leaf':[2,10,30]
}

Search for the optimal forest

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

Evaluate the optimal forest

# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred) ** (1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))