7.5.1.Model Evaluation and Refinement - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Model Evaluation

  • In-sample evaluation tells us how well our model will fit the data used to train it
  • Problem?
    • It does not tell us how well the trained model can be used to predict new data
  • Solution?
    • In-sample data or training data
    • out-of-sample evaluation or test set

Training/Testing Sets

  • Split dataset into:
    • Training set (70%)
    • Testing set (30%)
  • Build and train the model with a training get
  • Use testing set to assess the performance of a predictive model
  • When we have completed testing our model we should use all the data to train the model to get the best performance

Function train_test_split()

  • Split data into random train and test subsets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size-0.3, 
																										random_state=0)
  • x_date: features or independent variables
  • y_data: dataset target **df['price']**
  • x_train, y_train: parts of available data as training set
  • x_test, y_test: parts of available data as testing set
  • test_size: percentage of the data for testing (here 30%)
  • random_state: number generator user for random sampling

Generalization Performance

  • Generalization error is measure of how well our data does at predicting previously unseen data
  • The error we obtain using our testing data is an approximation of this error

Generalization Error

Cross Validation

  • Most common out-of-sample evaluation metrics
  • More effective use of data (each observation is used for both training and testing)

Function cross_val_score()

from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, x_data, y_data, cv=3)
np.mean(scores)

Function cross_val_predict()

  • It returns the prediction that was obtained for each element when it was in the test set
  • Has a similar interface to **cross_val_score()**
from sklearn.model_selection import cross_val_predict

yhat = cross_val_predict(lr2e, x_data, y_data, cv=3)

Question

consider the following lines of code, how may partitions or folds are used in the function cross_val_score:

from sklearn.model_selection import cross_val_sc
scores= cross_val_score(lr, x_data, y_data, cv=10)
  • 4
  • 10
  • 5

Correct

Practice Quiz: Model Evaluation

Question 1

What is the correct use of the "train_test_split" function such that 90% of the data samples will be utilized for training, the parameter "random_state" is set to zero, and the input variables for the features and targets are x_data, y_data respectively.

  • ~~train_test_split(x_data, y_data, test_size=0.9, random_state=0)~~
  • **train_test_split(x_data, y_data, test_size=0.1, random_state=0)**

Correct

Overfitting, Underfitting and Model Selection

Model Selection

y(x) + \text{noise}

y = b_0 + b_1x

y = b_0 + b_1x + b_2x^2

\hat{y} = b_0 + b_1x + b_2x^2 + b_3x^3 + b_4x^4 + b_5x^5 + b_6x^6 + b_7x^7 + b_8x^8

\hat{y} = b_0 + b_1x + b_2x^2 + b_3x^3 + b_4x^4 + b_5x^5 + b_6x^6 + b_7x^7 + b_8x^8 + b_9x^9 + b_{10}x^{10} + b_{11}x^{11} + b_{12}x^{12} + b_{13}x^{13} + b_{14}x^{14} + b_{15}x^{15} + b_{16}x^{16}

Model Selection

  • We select the order that minimizes the test error. In this case, it was eight.
  • Anything on the left would be considered underfitting. Anything on the right is overfitting.

Question

True or False, the following plot shows that as the order of the polynomial increases the mean square error of our model decreases on the test data:

  • False
  • True

Correct. This plot shows the training error

Practice Quiz: Overfitting, Underfitting and Model Selection

TOTAL POINTS 1

Question 1

In the following plot, the vertical axis shows the mean square error and the horizontal axis represents the order of the polynomial. The red line represents the training error the blue line is the test error. Should you select the 16 order polynomial.

  • no
  • yes

Correct. We use the test error to determine the model error. For this order of the polynomial, the training error is smaller but the test error is larger.

Ridge Regression

Ridge regression is a regression that is employed in a Multiple regression model when Multicollinearity occurs. Multicollinearity is when there is a strong relationship among the independent variables. Ridge regression is very common with polynomial regression.

y = 1 + 2x - 3x^2 - 4x^3 + x^4

  • Overfitting is a big problem when you have multiple independent variables, or features.
  • The estimated function in blue does a good job at approximating the true function.

  • In many cases real data has outliers. For example, the blue point shown above does not appear to come from the function in orange. If we use a tenth order polynomial function to fit the data, the estimated function in blue is incorrect, and is not a good estimate of the actual function in orange.

\hat{y} = 1 + 2x - 3x^2 - 2x^3 - 12x^4 - 40x^5 + 80x^6 + 71x^7 - 141x^8 - 38x^9 + 75x^{10}

  • If we examine the expression for the estimated function, we see the estimated polynomial coefficients have a very large magnitude. This is especially evident for the higher order polynomials.

  • Ridge regression controls the magnitude of these polynomial coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training the model. Each row in the following table represents an increasing value of alpha. Let's see how different values of alpha change the model. This table represents the polynomial coefficients for different values of alpha.
  • The column corresponds to the different polynomial coefficients, and the rows correspond to the different values of alpha. As alpha increases, the parameters get smaller. This is most evident for the higher order polynomial features. But Alpha must be selected carefully. If alpha is too large, the coefficients will approach zero and underfit the data. If alpha is zero, the overfitting is evident.
  • For alpha equal to 0.001, the overfitting begins to subside. For Alpha equal to 0.01, the estimated function tracks the actual function. When alpha equals one, we see the first signs of underfitting. The estimated function does not have enough flexibility. At alpha equals to 10, we see extreme underfitting. It does not even track the two points. In order to select alpha, we use cross validation.

Question

Consider the following fourth order polynomial, fitted with Ridge Regression; should we increase or decrease the parameter alpha?

  • Decrease
  • Increase

Correct. The model seems to be underfitting the data we should decrease the value of the parameter.

from sklearn.linear_model import Ridge
RidgeModel = Ridge(alpha=0.1)
RidgeModel.fit(x, y)
yhat = RidgeModel.predict(x)
  • To make a prediction using ridge regression, import ridge from sklearn.linear_models. Create a ridge object using the constructor. The parameter alpha is one of the arguments of the constructor. We train the model using the fit method. To make a prediction, we use the predict method.

  • The overfitting problem is even worse if we have lots of features. The following plot shows the different values of R-squared on the vertical axis.

  • The horizontal axis represents different values for alpha. We use several features from our used car data set and a second order polynomial function. The training data is in red and validation data is in blue. We see as the value for alpha increases, the value of R-squared increases and converges at approximately 0.75.
  • In this case, we select the maximum value of alpha because running the experiment for higher values of alpha have little impact. Conversely, as alpha increases, the R-squared on the test data decreases. This is because the term alpha prevents overfitting. This may improve the results in the unseen data, but the model has worse performance on the test data.

Practice Quiz: Ridge Regression

TOTAL POINTS 1

Question 1

the following models were all trained on the same data, select the model with the highest value for alpha:

  • a
  • b
  • c

Correct. a, b - The model that exhibits the "most" underfitting is usually the model with the highest parameter value for alpha. c - The model that exhibits overfitting is usually the model with the lowest parameter value for alpha

Grid Search

Grid Search allows us to scan through multiple free parameters with few lines of code.

Hyperparameters

  • The term alpha in Ridge regression is called a hyperparameter
  • Scikit-lean has a means of automatically iterating over these hyperparameters using cross-validation called Grid Serach

Question

What data do we use to pick the best hyperparameter

  • Training data
  • Validation data
  • Test data

Correct. The training data is used to get the model parameters, not the hyperparameters

parameters = [{'alpha': [1, 10, 100, 1000]}]

The value of your Grid Search is a Python list that contains a Python dictionary.

**'alpha'**: The key is the name of the free parameter.

**[1, 10, 100, 1000]**: The value of the dictionary is the different values of the free parameter.

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

parameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000]}]

RR = Ridge()

Grid1 = GridSearchCV(RR, parameters1, cv=4)

Grid1.fit(x_data['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'), y_data)

Grid1.best_estimator_

scores = Grid1.cv_results_
scores['mean_test_score']

What are the advantages of Grid Search is how quickly we can test multiple parameters.

For example, ridge regression has the option to normalize the data.

parameters = [{'alpha': [1, 10, 100, 1000], 'nomalize': [True, False]}]

**'alpha': [1, 10, 100, 1000]**: The term alpha is the first element in the dictionary.

**'nomalize':[True, False]**: The second element is the normalized option.

**'nomalize'**: The key is the name of the parameter.

**[True, False]**: The value is the different options in this case because we can either normalize the data or not. The values are True or False respectively.

The dictionary is a table or grid that contains two different values.

As before, we need the ridge regression object or model. The procedure is similar except that we have a table or grid of different parameter values. The output is the score for all the different combinations of parameter values. The code is also similar.

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

parameters2 = [{'alpha': [1, 10, 100, 1000], 'nomalize': [True, False]}]

RR = Ridge()

Grid1 = GridSearchCV(RR, parameters2, cv=4)

Grid1.fit(x_data['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'), y_data)

Grid1.best_estimator_

scores = Grid1.cv_results_

We can print out the score for the different free parameter values.

for param, mean_val, mean_test inzip(score['params'], scores['mean_test_score'], 
																		scores['mean_train_score']):
		print(param, 'R^2 on the test data:', mean_val, 'R^2 on train data:', mean_test)

Question

how many types of parameters does the following dictionary contain:

parameters= [{'alpha': [0.001,0.1,1, 10, 100], 'normalize' : [True, False] }]
  • 2
  • 9
  • 4

Correct

Lesson Summary

In this lesson, you have learned how to:

Identify over-fitting and under-fitting in a predictive model: Overfitting occurs when a function is too closely fit to the training data points and captures the noise of the data. Underfitting refers to a model that can't model the training data or capture the trend of the data.

Apply Ridge Regression to linear regression models: Ridge regression is a regression that is employed in a Multiple regression model when Multicollinearity occurs.

Tune hyper-parameters of an estimator using Grid search: Grid search is a time-efficient tuning technique that exhaustively computes the optimum values of hyperparameters performed on specific parameter values of estimators.