9.2.1.Linear Regression - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Introduction to Regression

It's related to co2 emissions from different cars. It includes engine size, number of cylinders, fuel consumption, and co2 emission from various automobile models.

Untitled

The question is: given this data set can we predict the co2 emission of a car using other fields such as engine size or cylinders?

Let's assume we have some historical data from different cars and assume that a car such as in row 9 has not been manufactured yet, but we're interested in estimating its approximate co2 emission after production.

Is it possible? We can use regression methods to predict a continuous value such as co2 emission using some other variables. Indeed, regression is the process of predicting a continuous value.

In regression, there are two types of variables:

  • dependent variable
    • can be seen as the state, target, or final goal we study and try to predict
    • are shown conventionally by X
  • independent variables
    • a.k.a. explanatory variables, can be seen as the causes of those states.
    • is notated by Y

A regression model relates Y or the dependent variable to a function of X i.e. the independent variables. The key point in the regression is that our dependent value should be continuous and cannot be a discrete value. However, the independent variable, or variables, can be measured on either a categorical or continuous measurement scale.

What is a regression model?

What we want to do here is to use the historical data of some cars using one or more of their features and from that data make a model. We use regression to build such a regression estimation model; then the model is used to predict the expected co2 emission for a new or unknown car.

Types of regression models

  • Simple regression: Simple regression is when one independent variable is used to estimate a dependent variable. Linearity of regression is based on the nature of relationship between independent and dependent variables. When more than one independent variable is present the process is called multiple linear regression.
    • Example: Predict co2 emission vs Engine Size of all cars
    • Simple Linear Regression
    • Simple Non-linear Regression
  • Multiple regression
    • Example: Predict co2 emission vs Engine Size and Cylinders of all cars
    • Multiple Linear Regression
    • Multiple Non-linear Regression

Applications of regression

  • Sales forecasting
    • Can try to predict a sales person's total yearly sales from independent variables such as age, education, and years of experience.
  • Satisfaction analysis
    • Can determine individual satisfaction, based on demographic and psychological factors
  • Price estimation
    • Can use regression analysis to predict the price of a house in an area, based on its size number of bedrooms, and so on
  • Employment income
    • Can predict employment income for independent variables such as hours of work, education, occupation, sex, age, years of experience, and so on

Question

Which one is a sample application of regression?

  • Predicting whether a patient has cancer or not.
  • Grouping of similar houses in an area.
  • Forecasting rainfall amount for next day.
  • Predicting if a team will win or not.

Correct

Regression algorithms

  • Ordinal regression
  • Poisson regression
  • Fast forest quantile regression
  • Linear, Polynomial, Lasso, Stepwise, Ridge regression
  • Bayesian linear regression
  • Neural network regression
  • Decision forest regression
  • Boosted decision tree regression
  • KNN (K-nearest neighbors)

Simple Linear Regression

Using linear regression to predict continuous values

Untitled

Let's take a look at this data set. It's related to the Co2 emission of different cars. It includes engine size, cylinders, fuel consumption and Co2 emissions for various car models.

The question is, given this data set, can we predict the Co2 emission of a car using another field such as engine size?

Quite simply, yes. We can use linear regression to predict a continuous value such as Co2 emission by using other variables. Linear regression is the approximation of a linear model used to describe the relationship between two or more variables.

Linear regression topology

  • Simple Linear Regression:
    • Predict co2 emission vs Engine Size of all cars
      • Independent variable (x): Engine Size
      • Dependent variable (y): co2 emission
  • Multiple Linear Regression:
    • Predict co2 emission vs Engine Size and Cylinders of all cars
      • Independent variable (x): Engine Size, Cylinders, etc
      • Dependent variable (y): co2 emission

Linear regression model representation

Untitled

\hat{y} = \theta_0 + \theta_1X_1

\hat{y} : response variable

\theta_0 : the intercept / coefficient of the linear equation

\theta_1 : the slope or gradient of the fitting line / coefficient of the linear equation

X_1: a single predictor

How to find the best fit?

Untitled

X_1 = 5.4 : independent variable

\color{green} y = 250$ : actual Co2 emission of $X_1

\hat{y} = \theta_0 + \theta_1X_1

\color{cyan}\hat{y} = 340$ : the predicted emission of $X_1

\color{red}Error \color{b}= \color{green}y \color{b}- \color{cyan}\hat{y} \color{b}= \color{green}250 \color{b}- \color{cyan}340 \color{b}= \color{red}-90

MSE = \frac 1 n \displaystyle\sum_{i=1}^n(y_i - \hat{y}_i)^2

Estimating the parameters

Untitled

\LARGE\hat{y} = \color{orange}\theta_0 \color{b}+ \color{#0280ba}\theta_1\color{b}X_1

\color{magenta}\theta_1 = \dfrac{\textstyle\sum_{i=1}^s(x_i-\bar{x})(y_i-\bar{y})}{\textstyle\sum_{i=1}^s(x_i-\bar{x})^2}

Predictions with linear regression

Untitled

Pros of linear regression

  • Very fast
  • No parameter tuning
  • Easy to understand, and highly interpretable

Model Evaluation in Regression Models

The goal of regression is to build a model to accurately predict an unknown case.

To this end, we have to perform regression evaluation after building the model.

Model evaluation approaches

  • Train and Test on the Same Dataset
  • Train/Test Split

Best approach for most accurate results?

When considering evaluation models, we clearly want to choose the one that will give us the most accurate results.

So, the question is, how can we calculate the accuracy of our model? In other words, how much can we trust this model for prediction of an unknown sample using a given dataset and having built a model such as linear regression?

One of the solutions is to select a portion of our dataset for testing. For instance, assume that we have 10 records in our dataset.

We use the entire dataset for training, and we build a model using this training set.

1.JPG

Now, we select a small portion of the dataset, such as row number six to nine, but without the labels. This set is called a test set, which has the labels, but the labels are not used for prediction and is used only as ground truth. The labels are called actual values of the test set. Now we pass the feature set of the testing portion to our built model and predict the target values. Finally, we compare the predicted values by our model with the actual values in the test set. This indicates how accurate our model actually is. There are different metrics to report the accuracy of the model, but most of them work generally based on the similarity of the predicted and actual values.

Calculating the accuracy of a model

We just compare the actual values y with the predicted values, which is noted as y hat for the testing set. The error of the model is calculated as the average difference between the predicted and actual values for all the rows. We can write this error as an equation.

Train and test on the same dataset

Untitled

The first evaluation approach we just talked about is the simplest one, train and test on the same dataset. Essentially, the name of this approach says it all. You train the model on the entire dataset, then you test it using a portion of the same dataset.

In a general sense, when you test with a dataset in which you know the target value for each data point, you're able to obtain a percentage of accurate predictions for the model. This evaluation approach would most likely have a high training accuracy and the low out-of-sample accuracy since the model knows all of the testing data points from the training.

What is training & out-of-sample accuracy?

  • Training Accuracy
    • Training accuracy is the percentage of correct predictions that the model makes when using the test dataset.
    • High training accuracy isn't necessarily a good thing
    • Result of over-fitting
      • Over-fit: the model is overly trained to the dataset, which may capture noise and produce a non-generalized model
  • Out-of-sample Accuracy
    • Out-of-sample accuracy is the percentage of correct predictions that the model makes on data that the model has not been trained on.
    • It's important that our models have a high, out-of-sample accuracy
    • How can we improve out-of-sample accuracy?

Train/Test split evaluation approach

  • Test on a portion of train set

    Untitled

    • Test-set is a portion of the train-set
    • High 'training accuracy'
    • Low 'out-of-sample' accuracy'
  • Train/Test Split

    Untitled

    • Mutually exclusive
    • More accurate evaluation on out-of-sample accuracy
    • Highly dependent on which datasets the data is trained and tested

How to use K-fold cross-validation?

Untitled

The entire dataset is represented by the points in the image at the top left. If we have K equals four folds, then we split up this dataset as shown here.

In the first fold for example, we use the first 25 percent of the dataset for testing and the rest for training. The model is built using the training set and is evaluated using the test set. Then, in the next round or in the second fold, the second 25 percent of the dataset is used for testing and the rest for training the model. Again, the accuracy of the model is calculated. We continue for all folds. Finally, the result of all four evaluations are averaged.

That is, the accuracy of each fold is then averaged, keeping in mind that each fold is distinct, where no training data in one fold is used in another.

K-fold cross-validation in its simplest form performs multiple train/test splits, using the same dataset where each split is different. Then, the result is average to produce a more consistent out-of-sample accuracy.

Evaluation Metrics in Regression Models

What is an error of the model?

Untitled

  • Error: measure of how far the data is from the fitted regression line.
  • MAE(Mean Absolute Error): the mean of the absolute value of the errors
  • MSE(Mean Square Error): the mean of the squared error
  • RMSE(Root Mean Square Error): the square root of the mean squared error
    • One of the most popular of the evaluation metrics because Root Mean Squared Error is interpretable in the same units as the response vector or Y units, making it easy to relate its information.
  • RAE(Relative Absolute Error): a.k.a. Residual sum of square, where Y bar is a mean value of Y, takes the total absolute error and normalizes it. By dividing by the total absolute error of the simple predictor.
    • Very similar to relative absolute error, but is widely adopted by the data science community as it is used for calculating R-squared.
  • R-squared: not an error per say but is a popular metric for the accuracy of your model
    • Represents how close the data values are to the fitted regression line
    • The higher the R-squared, the better the model fits your data

Multiple Linear Regression

Types of regression models

  • Simple Linear Regression
    • Predict Co2emission vs EngineSize of all cars
      • Independent variable (x): EngineSize
      • Dependent variable (y): Co2emission
  • Multiple Linear Regression
    • Predict Co2emission vs EngineSize and Cylinders of all cars
      • Independent variable (x): EngineSize, Cylinders, etc.
      • Dependent variable (y): Co2emission

Examples of multiple linear regression

  • Independent variables effectiveness on prediction
    • Does revision time, test anxiety, lecture attendance and gender have any effect on the exam performance of students?
  • Predicting impacts of changes
    • How much does blood pressure go up (or down) for every unit increase (or decrease) in the BMI of a patient?

Predicting continuous values with multiple linear regression

Multiple linear regression is very useful because you can examine which variables are significant predictors of the outcome variable. Also, you can find out how each feature impacts the outcome variable.

Untitled

Using MSE to expose the errors in the model

Mathematically, MSE can be shown by an equation. While this is not the only way to expose the error of a multiple linear regression model, it is one of the most popular ways to do so.

Estimating multiple linear regression parameters

  • How to estimate ?
    • Ordinary Least Squares
      • Linear algebra operations
      • Takes a long time for large datasets (10K + rows)
    • An optimization algorithm
      • Gradient Descent
      • Proper approach if you have a very large dataset

Question

What is the best approach to find the parameter or coefficients for multiple linear regression, when we have very large dataset?

  • Using linear algebra operations
  • Using an optimization approach

Correct

Making predictinos with multiple linear regression