Logistic Regression Theory - agastya2002/IECSE-ML-Winter-2020 GitHub Wiki
Logistic Regression is a Machine Learning classification algorithm used to predict the probability of a categorical dependent value. The dependent variable is a binary variable that contains the data coded as 1 (success,yes) or 0 (failure,no).
For example, Logistic Regression Models can be used for:
- Spam filtering (To predict whether an email is a spam)
- Online Transaction (To determine if a transaction is fraudulent or not)
- Tumour Malignacy (To predict if a tumour is malignant or not)
An example for a discrete value graph (Tumour Malignacy) is given below:
Now, classification problems need to not necessarily be binary( Yes or No). It can also be a multiple classification problem, which means it could have more than 2 types of classification. For example, there a types on cancer such as Type 2 or Type 3 cancer.
Now, suppose for the graph given above a fit a linear hypothesis, the best fit curve would be somewhat like this (assuming the threshold at 0.5):
The model basically divides the graph into two halves the value left to the green point can be considered as the negative class and the right as the positive class.
Now an issue would arise if we had an outlier in the data. This would mess up the prediction of the linear model
The green dotted line (Decision Boundary) is dividing malignant tumors from benign tumors but the line should have been at a yellow line which is clearly dividing the positive and negative examples. So just a single outlier is disturbing the whole linear regression predictions. And that is where logistic regression comes into a picture.
Over here our goal is to find a function that penalizes the curve such that if we encounter an outlier point, it would change the gradient but at the same time maintain and satisfy the previous points. Such functions are called Activation Functions.
Let's take the example of the sigmoid activation function. Consider t as a linear function in a univariate regression model
So, the logistic equation becomes:
Now when we encounter an outlier, the model takes care of it
In logistic regression we deal with probabilities and hence the output of the model's prediction depends on this probability as our hypothesis function provides a probabilistic answer. In such cases, we would have to provide a threshold value that segregates the classes we have the dataset (such as the yes and no).
Such a prediction based on the threshold value results in a curve that provides a boundary between the classes, termed as the Decision Boundary.
Note: The decision boundary is not dependent on the training set but instead on the hypothesis function
Binary (Simple) Variate Logistic Regression has a Linear decision boundary
Multi-Variate Logistic Regression has a Non-Linear decision boundary