18 02 The Bias Variance Tradeoff - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Generalization Error

  • Generalization Error = bias^2 + variance + irreducible error
  • Bias: difference between f and f^
    • high bias -> underfitting
  • Variance: how much f^ is inconsistent over different training sets
  • Bias-Variance tradeoff

02 Diagnose bias and variance problems

Estimating the Generalization Error

  • Can not be done directly
    • f is unknown
    • usually you only have one dataset
    • noise is unpredictable
  • Solution:
    • split the data to training and test sets
    • fit f^ to the training set
    • evaluate the error of f^ on the unseen test set (roughly equal to generalization error)

Better Model Evaluation with Cross-validation

  • Test set should not be touched until we are confident about f^'s performance
  • Evaluating f^ on training set is biased
  • If f^ suffers from high variance: CV error > training error
    • Overfitting.
    • Decrease model complexity or gather more data.
  • If f^ suffers from high bias: CV error ~ training error >> desired error
    • Underfitting
    • Increase model complexity or gather more relevent features

Example

  • Initiate the model
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)
  • Evaluate the 10-fold CV error
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV)) #CV RMSE: 5.14
  • Evaluate the training error
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train)) # Train RMSE: 5.15
  • dt suffers from high bias because RMSE_CV ~ RMSE_train and both scores are greater than baseline_RMSE.

03 Ensemble Learning

  • Train different model on the same set of data
  • Let each model make its prediction
  • Meta-model: aggregates predictions of individual models
  • Final prediction: more robust and less prone to errors

Voting Classifier

  • Binary classification task
  • N classifiers make predictions : P1 - Pn
  • Meta-model prediction: hard voting

Example

  • Define the ensemble
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]
  • Evaluate individual classifiers
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))
 Logistic Regression : 0.747
 K Nearest Neighbours : 0.724
 Classification Tree : 0.730
  • Better performance with a Voting Classifier
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy)) # Voting Classifier: 0.753