18 02 The Bias Variance Tradeoff - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Generalization Error

Generalization Error = bias^2 + variance + irreducible error
Bias: difference between f and f^
- high bias -> underfitting
Variance: how much f^ is inconsistent over different training sets
Bias-Variance tradeoff

02 Diagnose bias and variance problems

Estimating the Generalization Error

Can not be done directly
- f is unknown
- usually you only have one dataset
- noise is unpredictable
Solution:
- split the data to training and test sets
- fit f^ to the training set
- evaluate the error of f^ on the unseen test set (roughly equal to generalization error)

Better Model Evaluation with Cross-validation

Test set should not be touched until we are confident about f^'s performance
Evaluating f^ on training set is biased
If f^ suffers from high variance: CV error > training error
- Overfitting.
- Decrease model complexity or gather more data.
If f^ suffers from high bias: CV error ~ training error >> desired error
- Underfitting
- Increase model complexity or gather more relevent features

Example

Initiate the model

# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

Evaluate the 10-fold CV error

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV)) #CV RMSE: 5.14

Evaluate the training error

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train)) # Train RMSE: 5.15

dt suffers from high bias because RMSE_CV ~ RMSE_train and both scores are greater than baseline_RMSE.

03 Ensemble Learning

Train different model on the same set of data
Let each model make its prediction
Meta-model: aggregates predictions of individual models
Final prediction: more robust and less prone to errors

Voting Classifier

Binary classification task
N classifiers make predictions : P1 - Pn
Meta-model prediction: hard voting

Example

Define the ensemble

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

Evaluate individual classifiers

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

 Logistic Regression : 0.747
 K Nearest Neighbours : 0.724
 Classification Tree : 0.730

Better performance with a Voting Classifier

# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy)) # Voting Classifier: 0.753