7.4.1.Model Development - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

A model can be thought of as a mathematical equation used to predict a value one or more other values
Relating one or more independent variables to dependent variables

Example:

independent variables of features ('highway-mpg': 55mpg) 
→ MODEL
→ dependent variables ('predicted price': $5000)

Usually the more relevant data you have the more accurate your model is

('highway-mpg', 'curb-weight', 'engine-size')
→ MODEL
→ ('price': $5400)

In addition to getting more data you can try different types of models.
In this course you will learn about:
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression

Linear Regression and Multiple Linear Regression

Linear regression will refer to one independent variable to make a prediction
Multiple Linear regression will refer to multiple independent variables to make a prediction

Simple Linear Regression (SLR)

The predictor (independent) variable - x
The target (dependent) variable - y

$y = b_0 + b_1x$
- $b_0$ : the intercept
- $b_1$ : the slope

Fitting a Simple Linear Model Estimator

X: Predictor variable
Y: Target variable

Import linear_model from scikit-learn

from sklearn.linear_model import LinearRegression

Create a Linear Regression Object using the constructor:

lm = LinearRegression()

Fitting a Simple Linear Model

We define the predictor variable and target variable

X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']

Then use lm.fit(X, Y) to fit the model, i.e fine the parameters $b_0$ and $b_1$

lm.fit(X, Y)

We can obtain a prediction

Yhat = lm.predict(X)

SLR - Estimated Linear Model

We can view the intercept ( $b_0$ ):

lm.intercept_
38423.305858

We can also view the slope ( $b_1$ ):

lm.coef_
-821.73337832

The Relationship between Price and Highway MPG is given by:
Price = 38423.31 - 821.73 * highway-mpg / $\widehat{Y} = b_0 + b_1x$

Multiple Linear Regression (MLR)

This method is used to explain the relationship between:

One continuous target (Y) variable
Two or more predictor (X) variables
$\widehat{Y} = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4$
- $b_0$ : intercept (X=0)
- $b_1$ : the coefficient or parameter of $x_1$
- $b_2$ : the coefficient or parameter of $x_2$ and so on...
$\widehat{Y} = 1 + 2x_1 + 3x_2$
- The variables $x_1$ and $x_2$ can be visualized on a 2D plane

Fitting a Multiple Linear Model Estimator

We can extract the for 4 predictor variables and store them in the variable Z

Z = df['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg')

Then train the model as before:

lm.fit(Z, df['price'])

We can also obtain a prediction

Yhat = lm.predict(X)

MLR - Estimated Linear Model

Find the intercept ( $b_0$ )

lm.intercept_
-15678.742628061467

Find the coefficients ( $b_1, b_2, b_3, b_4$ )

lm.coef_
array([52.65851272, 4.69878948, 81.95906216, 33.58258185])

The Estimated Linear Model:

Price = -15678.74 + (52.66) * horsepower + (4.70) * curb-weight + (81.96) * engine-size + (33.58) * highway-mpg
$\widehat{Y} = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4$

Practice Quiz: Linear Regression and Multiple Linear Regression

TOTAL POINTS 2

Question 1

consider the following lines of code, what variable contains the predicted values :

from sklearn.linear_model import LinearRegression 
lm=LinearRegression()
X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']
lm.fit(X, Y)
Yhat=lm.predict(X)

Y
X
Yhat

Correct

Question 2

consider the following equation:

$y = b_0 + b_1x$

what is the parameter $b_0$ (b subscript 0)

~~the predictor or independent variable~~
~~the target or dependent variable~~
the intercept
~~the slope~~

Correct

Model Evaluation using Visualization

Regression Plot

Why use regression plot?

It gives us a good estimate of:

The relationship between two variavles
The strength of the correlation
The direction of the relationship (positive or negative)

Regression Plot shows us a combination of:

The scatterplot: where each point represents a different y
The fitted linear regression line ( $\widehat{y}$ )

import seaborn as sns

sns.regplot(x='highway-mpg', y='price', data=df)
plt.ylim(0, )

Residual Plot

Look at the spread of the residuals:
- Randomly spread out around x-axis then a linear model is appropriate

Not randomly spread out around the x-axis
Nonlinear model may be more appropriate

Not randomly spread out around the x-axis
Variance appears to change with x-axis

import seaborn as sns

sns.residplot(df['highway_mpg'], df['price'])

Distribution Plots

Compare the distribution plots:

The fitted values that result from the model
The actual values

MLR - Distribution Plots

import seaborn as sns

ax1 = sns.distplot(df['price'], hist=False, color='r', label='Actual Value')

sns.distplot(Yhat, hist=False, color='b', label='Fitted Values', ax=ax1)

Polynomial Regression and Pipelines

Polynomial Regression

A special case of the general linear regression model
Useful for describing curvilinear relationships

Curvilinear relationship:

By squaring or setting higher-order terms of the predictor variables

Quadratic - 2nd order
- $\widehat{Y} = b_0 + b_1x_1 + b_2(x_1)^2$

Cubic - 3rd order
- $\widehat{Y} = b_0 + b_1x_1 + b_2(x_1)^2 + b_3(x_1)^3$

Higher order
- $\widehat{Y} = b_0 + b_1X_1 + b_2(x_1)^2 + b_3(x_1)^3 + ...$

Example:

Calculate Polynomial of 3rd order

f = np.polyfit(x, y, 3)
p = np.polyld(f)

Print out the model

print(p)

$-1.557(x_1)^3 + 204.8(x_1)^2 + 8965x_1 + 1.37\times10^5$

Polynomial Regression with More than One Dimension

There are also multi dimensional polynomial linear regression

$\widehat{Y} = b_0 + b_1X_1 + b_2X_2 + b_3X_1X_2 + b_4(X_1)^2 + b_5(X_2)^2 + ...$
The "preprocessing" library in scikit-learn

from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2, include_bias=False)

x_polly = pr.fit_transform(x['horsepower', 'curb-weight'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight'))

**pr = polynomialFeatures(degree=2)**

**pr = PolynomialFeatures(degree=2, include_bias=False)
pr.fit_transform([1, 2](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/1,-2))**

Pre-processing

For example we can Normalize the each feature simultaneously

from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data['horesepower', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horesepower',-'highway-mpg'))
x_scale = SCALE.transform(X_data['horesepower', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horesepower',-'highway-mpg'))

Pipelines

There are many steps to getting a prediction

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Input = [('scale', StandardScaler()), ('polynominal', PolynomialFeatures(degree=2),...
					('mode', LinearRegression())]

Pipeline constructor

pipe = Pipeline(Input)

We can train the pipeline object

Pipe.fit(df['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'), y)

yhat = Pipe.predict(x['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'))

Measures for In-Sample Evaluation

A way to numerically determine how good the model fits on dataset.
- Two important measures to determine the fit of a model:
  - Mean Squared Error (MSE)
  - R-squared (R^2)

Mean Squared Error (MSE)

For Example for sample 1:

In python

from sklearn.metrics import mean_squared_error

mean_squared_error(df['price'], Y_predict_simple_fit)

3163502.944639888

R-squared / R^2

The Coefficient of Determination or R squared (R^2)
Is a measure to determine how close the data is to the fitted regression line.
R^2: the percentage of variation of the target variable (Y) that is explained by the linear model.
Think about as comparing a regression model to a simple model i.e. the mean of the data points

Coefficient of Determination (R^2)

In this example the average of the data points $\overline{y}$ is 6

$R^2 = (1 - \dfrac{\textsf{MSE of regression line}} {\textsf{MSE of the average of the data}})$
The blue line represents the regression line
The blue squares represents the MSE of the regression line
The red line represents the average value of the data points
The red squares represent the MSE of the red line
We see the area of the blue squares is much smaller than the area of the red squares
In this case ratio of the areas of MSE is close to zero
$R^2 = (1 - \dfrac{\textsf{MSE of regression line}} {\textsf{MSE of} ;\overline{y}}) = (1 - 0) = 1$
- We get a value near one, this means the line is a good fit for the data.
An Example of a line that does not fit the data well
$R^2 = (1 - \dfrac{\textsf{MSE of regression line}} {\textsf{MSE of} ;\overline{y}}) = (1 - 1) = 0$
- The ratio of the areas is close to one. In this case the R^2 is near zero. This line performs about the same as just using the average of the data points, therefore, this line did not perform well.
Generally the values of the MSE are between 0 and 1.
We can calculate the R^2 as follows

X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']

lm.fit(X, Y)

lm.score(X, Y)

0.496591188

From the value that we get from this example, we can say that approximately 49.695% of the variation of price is explained by this simple linear model.
Your R^2 value is usually between 0 and 1. If your R^2 is negative, it can be due to over fitting.

Prediction and Decision Making

Decision Making: Determining a Good Model Fit

To determine final best fit, we look at a combination of:

Do the predicted values make sense
Visualization
Numerical measures for evaluation
Comparing Models

Do the predicted values make sense

First we train the model

lm.fit(df['highway-mpg'], df['price'])

Let's predict the price of a car with 30 highway-mpg

lm.predict(np.array(30.0).reshape(-1, 1))

Result: $13771.30

lm.coef_
-821.73337832

Price = 38423.31 - 821.73 * highway-mpg
Using the numpy function arrange to generate a sequence from 1 to 100

import numpy as np

new_input = np.arange(1, 101, 1).reshape(-1, 1)

Can predict new values

yhat = lm.predict(new_input)

Visualization

Simply visualizing your data with a regression

Residual Plot

Visualization - Multiple linear regression

Numerical measures for Evaluation

The figure shows an example of a mean square error of 3495.

This example has a mean square error of 3652.

The one has a mean square error of 12870.

As the square error increases the targets get further from the predicted points.

r squared is another popular method to evaluate your model. It tells you how well your line fits into the model. r squared values range from zero to one. r squared tells us what percent of the variability in the dependent variable is accounted for by the regression on the independent variable. An r squared of 1 means that all movements of another dependent variable are completely explained by movements in the independent variables.

In this plot we see the target points in red and the predicted line in blue. An r squared of 0.9986 the model appears to be a good fit. That means that more than 99 of the variability of the predicted variable is explained by the independent variables.

This model has an r squared of 0.9226 there still is a strong linear relationship model is still a good fit.

An r squared of 0.806 of the data we can visually see that the values are scattered around the line. They are still close to the line and we can say that 80 percent of the variability of the predicted variable is explained by the independent variables.

An r squared 0.61 means that approximately 61 percent of the observed variation can be explained by the independent variables.

An acceptable value for r squared depends on what field you are studying and what your use case is. Falcon Miller 1992 suggests that an acceptable r squared value should be at least 0.1.

Comparing MLR and SLR

Does a lower Mean Square Error imply better fit?

Not necessarily

Mean Square Error for a Multiple Linear Regression Model will be smaller than the Mean Square Error for a Simple Linear Regression Model, since the errors of the data will decrease when more variables are include in the model.
Polynomial regression will also have a smaller Mean Square Error than the Linear Regression

Lesson Summary

In this lesson, you have learned how to:

Define the explanatory variable and the response variable: Define the response variable (y) as the focus of the experiment and the explanatory variable (x) as a variable used to explain the change of the response variable. Understand the differences between Simple Linear Regression because it concerns the study of only one explanatory variable and Multiple Linear Regression because it concerns the study of two or more explanatory variables.

Evaluate the model using Visualization: By visually representing the errors of a variable using scatterplots and interpreting the results of the model.

Identify alternative regression approaches: Use a Polynomial Regression when the Linear regression does not capture the curvilinear relationship between variables and how to pick the optimal order to use in a model.

Interpret the R-square and the Mean Square Error: Interpret R-square (x 100) as the percentage of the variation in the response variable y that is explained by the variation in explanatory variable(s) x. The Mean Squared Error tells you how close a regression line is to a set of points. It does this by taking the average distances from the actual points to the predicted points and squaring them.