7.4.1.Model Development - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
- A model can be thought of as a mathematical equation used to predict a value one or more other values
- Relating one or more independent variables to dependent variables
Example:
independent variables of features ('highway-mpg': 55mpg)
→ MODEL
→ dependent variables ('predicted price': $5000)
- Usually the more relevant data you have the more accurate your model is
('highway-mpg', 'curb-weight', 'engine-size')
→ MODEL
→ ('price': $5400)
- In addition to getting more data you can try different types of models.
- In this course you will learn about:
- Simple Linear Regression
- Multiple Linear Regression
- Polynomial Regression
Linear Regression and Multiple Linear Regression
- Linear regression will refer to one independent variable to make a prediction
- Multiple Linear regression will refer to multiple independent variables to make a prediction
Simple Linear Regression (SLR)
-
The predictor (independent) variable - x
-
The target (dependent) variable - y
- : the intercept
- : the slope
Fitting a Simple Linear Model Estimator
- X: Predictor variable
- Y: Target variable
- Import
linear_model
from scikit-learn
from sklearn.linear_model import LinearRegression
- Create a Linear Regression Object using the constructor:
lm = LinearRegression()
Fitting a Simple Linear Model
- We define the predictor variable and target variable
X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']
- Then use
lm.fit(X, Y)
to fit the model, i.e fine the parameters $b_0$ and $b_1$
lm.fit(X, Y)
- We can obtain a prediction
Yhat = lm.predict(X)
SLR - Estimated Linear Model
- We can view the intercept ():
lm.intercept_
38423.305858
- We can also view the slope ():
lm.coef_
-821.73337832
- The Relationship between Price and Highway MPG is given by:
- Price = 38423.31 - 821.73 * highway-mpg /
Multiple Linear Regression (MLR)
This method is used to explain the relationship between:
-
One continuous target (Y) variable
-
Two or more predictor (X) variables
-
- : intercept (X=0)
- : the coefficient or parameter of
- : the coefficient or parameter of and so on...
-
- The variables and can be visualized on a 2D plane
Fitting a Multiple Linear Model Estimator
- We can extract the for 4 predictor variables and store them in the variable Z
Z = df['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg')
- Then train the model as before:
lm.fit(Z, df['price'])
- We can also obtain a prediction
Yhat = lm.predict(X)
MLR - Estimated Linear Model
- Find the intercept ()
lm.intercept_
-15678.742628061467
- Find the coefficients ()
lm.coef_
array([52.65851272, 4.69878948, 81.95906216, 33.58258185])
The Estimated Linear Model:
- Price = -15678.74 + (52.66) * horsepower + (4.70) * curb-weight + (81.96) * engine-size + (33.58) * highway-mpg
Practice Quiz: Linear Regression and Multiple Linear Regression
TOTAL POINTS 2
Question 1
consider the following lines of code, what variable contains the predicted values :
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']
lm.fit(X, Y)
Yhat=lm.predict(X)
YX- Yhat
Correct
Question 2
consider the following equation:
what is the parameter (b subscript 0)
the predictor or independent variablethe target or dependent variable- the intercept
the slope
Correct
Model Evaluation using Visualization
Regression Plot
Why use regression plot?
It gives us a good estimate of:
- The relationship between two variavles
- The strength of the correlation
- The direction of the relationship (positive or negative)
Regression Plot shows us a combination of:
- The scatterplot: where each point represents a different y
- The fitted linear regression line ()
import seaborn as sns
sns.regplot(x='highway-mpg', y='price', data=df)
plt.ylim(0, )
Residual Plot
- Look at the spread of the residuals:
- Randomly spread out around x-axis then a linear model is appropriate
- Not randomly spread out around the x-axis
- Nonlinear model may be more appropriate
- Not randomly spread out around the x-axis
- Variance appears to change with x-axis
import seaborn as sns
sns.residplot(df['highway_mpg'], df['price'])
Distribution Plots
Compare the distribution plots:
- The fitted values that result from the model
- The actual values
MLR - Distribution Plots
import seaborn as sns
ax1 = sns.distplot(df['price'], hist=False, color='r', label='Actual Value')
sns.distplot(Yhat, hist=False, color='b', label='Fitted Values', ax=ax1)
Polynomial Regression and Pipelines
Polynomial Regression
- A special case of the general linear regression model
- Useful for describing curvilinear relationships
Curvilinear relationship:
By squaring or setting higher-order terms of the predictor variables
- Quadratic - 2nd order
- Cubic - 3rd order
- Higher order
Example:
- Calculate Polynomial of 3rd order
f = np.polyfit(x, y, 3)
p = np.polyld(f)
- Print out the model
print(p)
Polynomial Regression with More than One Dimension
-
There are also multi dimensional polynomial linear regression
-
The "preprocessing" library in scikit-learn
from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2, include_bias=False)
x_polly = pr.fit_transform(x['horsepower', 'curb-weight'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight'))
**pr = polynomialFeatures(degree=2)**
**pr = PolynomialFeatures(degree=2, include_bias=False)
pr.fit_transform([1, 2](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/1,-2))**
Pre-processing
- For example we can Normalize the each feature simultaneously
from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data['horesepower', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horesepower',-'highway-mpg'))
x_scale = SCALE.transform(X_data['horesepower', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horesepower',-'highway-mpg'))
Pipelines
- There are many steps to getting a prediction
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Input = [('scale', StandardScaler()), ('polynominal', PolynomialFeatures(degree=2),...
('mode', LinearRegression())]
- Pipeline constructor
pipe = Pipeline(Input)
- We can train the pipeline object
Pipe.fit(df['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'), y)
yhat = Pipe.predict(x['horsepower', 'curb-weight', 'engine-size', 'highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'horsepower',-'curb-weight',-'engine-size',-'highway-mpg'))
Measures for In-Sample Evaluation
- A way to numerically determine how good the model fits on dataset.
- Two important measures to determine the fit of a model:
- Mean Squared Error (MSE)
- R-squared (R^2)
- Two important measures to determine the fit of a model:
Mean Squared Error (MSE)
- For Example for sample 1:
- In python
from sklearn.metrics import mean_squared_error
mean_squared_error(df['price'], Y_predict_simple_fit)
3163502.944639888
R-squared / R^2
- The Coefficient of Determination or R squared (R^2)
- Is a measure to determine how close the data is to the fitted regression line.
- R^2: the percentage of variation of the target variable (Y) that is explained by the linear model.
- Think about as comparing a regression model to a simple model i.e. the mean of the data points
Coefficient of Determination (R^2)
- In this example the average of the data points is 6
-
The blue line represents the regression line
-
The blue squares represents the MSE of the regression line
-
The red line represents the average value of the data points
-
The red squares represent the MSE of the red line
-
We see the area of the blue squares is much smaller than the area of the red squares
-
In this case ratio of the areas of MSE is close to zero
-
- We get a value near one, this means the line is a good fit for the data.
-
An Example of a line that does not fit the data well
-
- The ratio of the areas is close to one. In this case the R^2 is near zero. This line performs about the same as just using the average of the data points, therefore, this line did not perform well.
-
Generally the values of the MSE are between 0 and 1.
-
We can calculate the R^2 as follows
X = df['highway-mpg'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'highway-mpg')
Y = df['price']
lm.fit(X, Y)
lm.score(X, Y)
0.496591188
- From the value that we get from this example, we can say that approximately 49.695% of the variation of price is explained by this simple linear model.
- Your R^2 value is usually between 0 and 1. If your R^2 is negative, it can be due to over fitting.
Prediction and Decision Making
Decision Making: Determining a Good Model Fit
To determine final best fit, we look at a combination of:
- Do the predicted values make sense
- Visualization
- Numerical measures for evaluation
- Comparing Models
Do the predicted values make sense
- First we train the model
lm.fit(df['highway-mpg'], df['price'])
- Let's predict the price of a car with 30 highway-mpg
lm.predict(np.array(30.0).reshape(-1, 1))
- Result: $13771.30
lm.coef_
-821.73337832
-
Price = 38423.31 - 821.73 * highway-mpg
-
Using the numpy function arrange to generate a sequence from 1 to 100
import numpy as np
new_input = np.arange(1, 101, 1).reshape(-1, 1)
- Can predict new values
yhat = lm.predict(new_input)
Visualization
- Simply visualizing your data with a regression
Residual Plot
Visualization - Multiple linear regression
Numerical measures for Evaluation
The figure shows an example of a mean square error of 3495.
This example has a mean square error of 3652.
The one has a mean square error of 12870.
- As the square error increases the targets get further from the predicted points.
r squared is another popular method to evaluate your model. It tells you how well your line fits into the model. r squared values range from zero to one. r squared tells us what percent of the variability in the dependent variable is accounted for by the regression on the independent variable. An r squared of 1 means that all movements of another dependent variable are completely explained by movements in the independent variables.
In this plot we see the target points in red and the predicted line in blue. An r squared of 0.9986 the model appears to be a good fit. That means that more than 99 of the variability of the predicted variable is explained by the independent variables.
This model has an r squared of 0.9226 there still is a strong linear relationship model is still a good fit.
An r squared of 0.806 of the data we can visually see that the values are scattered around the line. They are still close to the line and we can say that 80 percent of the variability of the predicted variable is explained by the independent variables.
An r squared 0.61 means that approximately 61 percent of the observed variation can be explained by the independent variables.
- An acceptable value for r squared depends on what field you are studying and what your use case is. Falcon Miller 1992 suggests that an acceptable r squared value should be at least 0.1.
Comparing MLR and SLR
Does a lower Mean Square Error imply better fit?
- Not necessarily
- Mean Square Error for a Multiple Linear Regression Model will be smaller than the Mean Square Error for a Simple Linear Regression Model, since the errors of the data will decrease when more variables are include in the model.
- Polynomial regression will also have a smaller Mean Square Error than the Linear Regression
Lesson Summary
In this lesson, you have learned how to:
Define the explanatory variable and the response variable: Define the response variable (y) as the focus of the experiment and the explanatory variable (x) as a variable used to explain the change of the response variable. Understand the differences between Simple Linear Regression because it concerns the study of only one explanatory variable and Multiple Linear Regression because it concerns the study of two or more explanatory variables.
Evaluate the model using Visualization: By visually representing the errors of a variable using scatterplots and interpreting the results of the model.
Identify alternative regression approaches: Use a Polynomial Regression when the Linear regression does not capture the curvilinear relationship between variables and how to pick the optimal order to use in a model.
Interpret the R-square and the Mean Square Error: Interpret R-square (x 100) as the percentage of the variation in the response variable y that is explained by the variation in explanatory variable(s) x. The Mean Squared Error tells you how close a regression line is to a set of points. It does this by taking the average distances from the actual points to the predicted points and squaring them.