Performance Metrics for Classification problems - Singhak/MLIQ GitHub Wiki

Confusion Matrix

Confusion Matrix as the name suggests gives us a matrix as output and describes the complete performance of the model.
A confusion matrix is an N X N matrix, where N is the number of classes being predicted.
The Confusion matrix is one of the most intuitive and easiest (unless of course, you are not confused)metrics used for finding the correctness and accuracy of the model.
It is used for Classification problem where the output can be of two or more types of classes.
The Confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on Confusion Matrix and the numbers inside it.

Confusion Matrix

There are 4 important terms :

True Positives : The cases in which we predicted YES and the actual output was also YES.
True Negatives : The cases in which we predicted NO and the actual output was also NO.
False Positives : The cases in which we predicted YES and the actual output was NO. (Also known as a "Type I error.")
False Negatives : The cases in which we predicted NO and the actual output was YES. (Also known as a "Type II error.")

When to minimise what?

There’s no hard rule that says what should be minimised in all the situations. It purely depends on the business needs and the context of the problem you are trying to solve. Based on that, we might want to minimise either False Positives or False negatives.

Minimising False Negatives:
- Let’s say in cancer detection problem example, out of 100 people, only 5 people have cancer. In this case, we want to correctly classify all the cancerous patients as even a very BAD model(Predicting everyone as NON-Cancerous) will give us a 95% accuracy. But, in order to capture all cancer cases, we might end up making a classification when the person actually NOT having cancer is classified as Cancerous.
- This might be okay as it is less dangerous than NOT identifying/capturing a cancerous patient since we will anyway send the cancer cases for further examination and reports. But missing a cancer patient will be a huge mistake as no further examination will be done on them.
Minimising False Positives:
- For better understanding of False Positives, let’s use a different example where the model classifies whether an email is spam or not. Let’s say that you are expecting an important email like hearing back from a recruiter or awaiting an admit letter from a university.
- Let’s assign a label to the target variable and say,1: “Email is a spam” and 0:”Email is not a spam”
- Suppose the Model classifies that important email that you are desperately waiting for, as Spam(case of False positive). Now, in this situation, this is pretty bad than classifying a spam email as important or not spam since in that case, we can still go ahead and manually delete it and it’s not a pain if it happens once a while. So in case of Spam email classification, minimising False positives is more important than False Negatives.

Accuracy:

Accuracy in classification problems is the number of correct predictions made by the model over all kinds predictions made.
Overall, how often is the classifier correct?
Accuracy = (true positives + true negative) / (true positives + true negative + false positive + false negatives)

When to use Accuracy:
- Accuracy is a good measure when the target variable classes in the data are nearly balanced.
Ex:60% classes in our fruits images data are apple and 40% are oranges. A model which predicts whether a new image is Apple or an Orange, 97% of times correctly is a very good measure in this example.
When NOT to use Accuracy:
- Accuracy should NEVER be used as a measure when the target variable classes in the data are a majority of one class.
Ex: In our cancer detection example with 100 people, only 5 people has cancer. Let’s say our model is very bad and predicts every case as No Cancer. In doing so, it has classified those 95 non-cancer patients correctly and 5 cancerous patients as Non-cancerous. Now even though the model is terrible at predicting cancer, The accuracy of such a bad model is also 95%\

Precision :

It is the number of correct positive results divided by the number of positive results predicted by the classifier. i.e. When it predicts yes, how often is it correct?
precision = true positives / (true positives + false positives)

Ex: In our cancer example with 100 people, only 5 people have cancer. Let’s say our model is very bad and predicts every case as Cancer. Since we are predicting everyone as having cancer, our denominator(True positives and False Positives) is 100 and the numerator, person having cancer and the model predicting his case as cancer is 5. So in this example, we can say that Precision of such model is 5%.

Precision

Recall or Sensitivity:

It is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
recall = true positives / (true positives + false negatives)

Ex: In our cancer example with 100 people, 5 people actually have cancer. Let’s say that the model predicts every case as cancer. So our denominator(True positives and False Negatives) is 5 and the numerator, person having cancer and the model predicting his case as cancer is also 5(Since we predicted 5 cancer cases correctly). So in this example, we can say that the Recall of such model is 100%. And Precision of such a model(As we saw above) is 5%

Recall

When to use Precision and When to use Recall?:

It is clear that recall gives us information about a classifier’s performance with respect to false negatives (how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we caught).
Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.
Recall is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.
So basically if we want to focus more on minimising False Negatives, we would want our Recall to be as close to 100% as possible without precision being too bad and if we want to focus on minimising False positives, then our focus should be to make Precision as close to 100% as possible.

Recall or Sensitivity (True Positive Rate):

It is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
recall = true positives / (true positives + false negatives)

Ex: In our cancer example with 100 people, 5 people actually have cancer. Let’s say that the model predicts every case as cancer. So our denominator(True positives and False Negatives) is 5 and the numerator, person having cancer and the model predicting his case as cancer is also 5(Since we predicted 5 cancer cases correctly). So in this example, we can say that the Recall of such model is 100%. And Precision of such a model(As we saw above) is 5%

Recall

When to use Precision and When to use Recall?:

It is clear that recall gives us information about a classifier’s performance with respect to false negatives (how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we caught).
Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.
Recall is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.
So basically if we want to focus more on minimising False Negatives, we would want our Recall to be as close to 100% as possible without precision being too bad and if we want to focus on minimising False positives, then our focus should be to make Precision as close to 100% as possible.

Specificity (False Positive Rate):

The false positive rate is defined as the number of false positives divided by the sum of all negatives in the set
When it's actually no, how often does it predict no?
Specificity = false positives / (true negatives + false negatives)
equivalent to 1 minus False Positive Rate
Specificity is the exact opposite of Recall.

Ex: In our cancer example with 100 people, 5 people actually have cancer. Let’s say that the model predicts every case as cancer. So our denominator(False positives and True Negatives) is 95 and the numerator, person not having cancer and the model predicting his case as no cancer is 0 (Since we predicted every case as cancer). So in this example, we can that that Specificity of such model is 0%.

Recall

F1 Score:

F1 Score is used to measure a test’s accuracy
F1 Score is the Harmonic Mean between precision and recall.
The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
F1 Score = 2 * Precision * Recall / (Precision + Recall)
If one number is really small between precision and recall, the F1 Score kind of raises a flag and is more closer to the smaller number than the bigger one,

Area Under Curve

This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds.
AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1]
AUC has a range of [0, 1]. The greater the value, the better is the performance of our model.

ROC

The false positive rate is on the X-axis and the true positive rate is on the Y-axis.
The diagonal line of x = y represents the expected performance of a random model, so a usable model's curve should be above that line.