01 Generalization Error
- Generalization Error = bias^2 + variance + irreducible error
- Bias: difference between f and f^
- high bias -> underfitting
- Variance: how much f^ is inconsistent over different training sets
- Bias-Variance tradeoff
02 Diagnose bias and variance problems
Estimating the Generalization Error
- Can not be done directly
- f is unknown
- usually you only have one dataset
- noise is unpredictable
- Solution:
- split the data to training and test sets
- fit f^ to the training set
- evaluate the error of f^ on the unseen test set (roughly equal to generalization error)
Better Model Evaluation with Cross-validation
- Test set should not be touched until we are confident about f^'s performance
- Evaluating f^ on training set is biased
- If f^ suffers from high variance: CV error > training error
- Overfitting.
- Decrease model complexity or gather more data.
- If f^ suffers from high bias: CV error ~ training error >> desired error
- Underfitting
- Increase model complexity or gather more relevent features
Example
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
# Set SEED for reproducibility
SEED = 1
# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)
- Evaluate the 10-fold CV error
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV)) #CV RMSE: 5.14
- Evaluate the training error
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Fit dt to the training set
dt.fit(X_train, y_train)
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train)) # Train RMSE: 5.15
dt
suffers from high bias because RMSE_CV
~ RMSE_train
and both scores are greater than baseline_RMSE.
03 Ensemble Learning
- Train different model on the same set of data
- Let each model make its prediction
- Meta-model: aggregates predictions of individual models
- Final prediction: more robust and less prone to errors
Voting Classifier
- Binary classification task
- N classifiers make predictions : P1 - Pn
- Meta-model prediction: hard voting
Example
# Set seed for reproducibility
SEED=1
# Instantiate lr
lr = LogisticRegression(random_state=SEED)
# Instantiate knn
knn = KNN(n_neighbors=27)
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]
- Evaluate individual classifiers
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:
# Fit clf to the training set
clf.fit(X_train, y_train)
# Predict y_pred
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
# Evaluate clf's accuracy on the test set
print('{:s} : {:.3f}'.format(clf_name, accuracy))
Logistic Regression : 0.747
K Nearest Neighbours : 0.724
Classification Tree : 0.730
- Better performance with a Voting Classifier
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)
# Fit vc to the training set
vc.fit(X_train, y_train)
# Evaluate the test set predictions
y_pred = vc.predict(X_test)
# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy)) # Voting Classifier: 0.753