Logistic Regression Theory - agastya2002/IECSE-ML-Winter-2020 GitHub Wiki

What is Logistic Regression ?

Logistic Regression is a Machine Learning classification algorithm used to predict the probability of a categorical dependent value. The dependent variable is a binary variable that contains the data coded as 1 (success,yes) or 0 (failure,no).

For example, Logistic Regression Models can be used for:

  1. Spam filtering (To predict whether an email is a spam)
  2. Online Transaction (To determine if a transaction is fraudulent or not)
  3. Tumour Malignacy (To predict if a tumour is malignant or not)

An example for a discrete value graph (Tumour Malignacy) is given below:

Tumour Graph

Now, classification problems need to not necessarily be binary( Yes or No). It can also be a multiple classification problem, which means it could have more than 2 types of classification. For example, there a types on cancer such as Type 2 or Type 3 cancer.

The Linear Regression Approach : A Naive Approach

Now, suppose for the graph given above a fit a linear hypothesis, the best fit curve would be somewhat like this (assuming the threshold at 0.5):

Linear Approach

The model basically divides the graph into two halves the value left to the green point can be considered as the negative class and the right as the positive class.

Threshold

Now an issue would arise if we had an outlier in the data. This would mess up the prediction of the linear model

Outlier

The green dotted line (Decision Boundary) is dividing malignant tumors from benign tumors but the line should have been at a yellow line which is clearly dividing the positive and negative examples. So just a single outlier is disturbing the whole linear regression predictions. And that is where logistic regression comes into a picture.

The Logistic Regression Model

Over here our goal is to find a function that penalizes the curve such that if we encounter an outlier point, it would change the gradient but at the same time maintain and satisfy the previous points. Such functions are called Activation Functions.

Activation Functions

Let's take the example of the sigmoid activation function. Consider t as a linear function in a univariate regression model

eqn1

So, the logistic equation becomes:

eqn2

Now when we encounter an outlier, the model takes care of it

Outlier Handled

Decision Boundary and Hypothesis Prediction

In logistic regression we deal with probabilities and hence the output of the model's prediction depends on this probability as our hypothesis function provides a probabilistic answer. In such cases, we would have to provide a threshold value that segregates the classes we have the dataset (such as the yes and no).

Such a prediction based on the threshold value results in a curve that provides a boundary between the classes, termed as the Decision Boundary.

Note: The decision boundary is not dependent on the training set but instead on the hypothesis function

Binary (Simple) Variate Logistic Regression has a Linear decision boundary

Linear Boundary

Linear Equation

Multi-Variate Logistic Regression has a Non-Linear decision boundary

Non-Linear Boundary

Non-Linear Equation

External Material

  1. https://www.youtube.com/watch?v=-la3q9d7AKQ
  2. https://www.youtube.com/watch?v=t1IT5hZfS48
  3. https://www.youtube.com/watch?v=F_VG4LNjZZw
  4. https://www.youtube.com/watch?v=HIQlmHxI6-0
⚠️ **GitHub.com Fallback** ⚠️