Module 1_ICP 5: Regression techniques Simple and Multiple Regression - acikgozmehmet/PythonDeepLearning GitHub Wiki

Regression techniques

Objectives:

The following topics are covered.

  1. Linear Regression
  2. Multiple Regression

Overview

a. Linear Regression algorithm

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression.

b. Multiple Regression algorithm

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable).

In Class Programming

1. Delete all the outlier data for the GarageArea field (for the same data set in the use case: House Prices).

Click here to get the source code

Here are the histograms of the feature and target values. It will give us an opinion about the distribution they have and possible outliers.

  • for this task you need to plot GarageArea field and SalePrice in scatter plot, then check which numbers are anomalies.

The data greater than 1200 and at 0 in the "GaragaArea" axis seem to be outliers. That is why, it is good to remove them with the following command

mask = (df['GarageArea'] <=1200) & (df['GarageArea'] > 0)
df = df[mask]

Here are the scatter plots before and after the removal of the outliers.

2. Create Multiple Regression for the “wine quality” dataset. In this data set “quality” is the target label.

Click here to get the source code

** You need to delete the null values in the data set

When we check the data, it seems that there is no missing data. Please check the following table for details:

** You need to find the top 3 most correlated features to the target label(quality)

Evaluate the model using RMSE and R2 score.

R-squared (R2) value is a measure of how close the data are to the fitted regression line. In general, a higher r-squared value means a better fit. For our case, it is 0.2864 which is closer to zero. That is why, we can conclude that the data does not fit the model.

The RMSE measures the distance between our predicted values and actual values. It is basically calculated with the residuals (observed-predcited). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. That is why; The smaller it is, the better the data fits to the model. For our case, it is 0.1299 which is bigger than expected. We can conclude that the data does not fit the model as expected.

References

https://towardsdatascience.com/linear-regression-with-example-8daf6205bd49

https://www.dataquest.io/blog/kaggle-getting-started/

https://en.wikipedia.org/wiki/Linear_regression

https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86