Confusion matrices - PeppermintT/SkeletonEDA GitHub Wiki

What is a confusion matrix?

A confusion matrix is a table layout that allows the performance of a classifier to be visualised. Such a matrix can be applied to a binary classifier (two classes) or a classifier with any number of classes for which the true values are known. A confusion matrix is not used when numeric values are being predicted (eg regression model).

Within a confusion matrix we find the following:

  • True Positives (TPs): the number of positive instances that the classifier correctly identified as positive.
  • True Negatives (TNs): the number of negative instances that the classifier correctly identified as negative.
  • False Positives (FPs): the number of instances the classifier identified as positive but are actually negative.
  • False Negatives (FNs): the number of instances the classifier identified as negative but are actually positive.

What is a good classifier? Ideally a classifier has a large TP and TN and small numbers for FP and FN.

Example of a binary classifier

I am starting with an example set out by DataSchool and I will try and build on this so I grasp the principles with more complex classifiers. I find that it is a leap from saying "the principles apply to a classifier where classes > 2" to actually understanding that. [Specifically I am going to work out what this means for a larger classifier.]

Example :

We have n= 165 patients. They have either tested positive for something or not. We can see that: *50 were predicted to not have the illness and actually don't. TN *100 were predicted to have the illness and actually do. TP *10 were predicted to have the illness and don't. We falsely predicted the positive class. FP (Also known as Type 1 Error) *5 were predicted to not have the illness and actually do. We falsely predicted the negative class. FN ( Also known as Type 2 Error)

The next step is to work model evaluation metrics. These metrics will help us choose between models:

  • True Positive Rate/ Recall (ML Lingo)/ Sensitivity (Sciency Lingo) : Those who actually have the illness are predicted to have it. TP/actual yes = (100/105 = 0.95)
  • True Negative Rate (Specificity may be the more common term) : It's actually no and it's predicted no. TN/actual no (50/60 = 0.83)
  • False Positive Rate: When it is actually no, how often does it predict yes. FP/actual no (10/60 = 0.17)

There are a few other important concepts & terms:

  • Accuracy : how often is the classifier correct. ((TN + TP)/ n). (50 + 100)/ 165 = 0.91)
  • Misclassification or Error Rate: how often is it wrong. This is 1- Accuracy or (FP + FN)/n). 15/165 = 0.09
  • Precision - when it predicts yes (or in this case illness) how often is it correct. TP/ predicted yes (100/110 = 0.91)
  • Prevelance - how often does yes (in this case illness) occur in the sample. Actual yes/total (105/165 = 0.64)

If you have a binary model and you are asked for the Precision ratio and the the recall ratio the convention is to calculate and report this for the positive class. (It is possible to calculate both these ratios for the "no" or the "negative" class, but that is not the convention.)

Model outputs

  • Software output won't have labels like in the blog.
  • Basically we need to check documentation for layout of confusion matrix.
  • Scikit learn layout is actual on the left, predicted on the top. This is the layout the blog post uses.

What about when there are more than two classes?

  • If you have more than two classes it is confusing to use these terms (TP, TN, FP, FN).
  • Some metrics are adapted for multi class models. Accuracy and misclassification (error) rates extend naturally to multi-class models.
  • Accuracy is computed by adding numbers across the diagonal and dividing by the total number of predictions.

Let's take a three class example

This is reported in the youtube clip (22 mins in).

Sci-kit learn output: y_true = [0,1,2,2,2] y_pred = [0,0,2,2,1]

  1. Accuracy: we can see that the prediction was right 3 out of 5. So accuracy is 3/5 (60%).

  2. Recall - (class zero) When true value is zero, how often did it predict zero. We can see that this happened 100% of the time.

  3. Precision - (class zero) When it predicted zero, how often was it correct? 1 out of 2. So 50%.

Precision & recall are very popular metrics because they can apply to multi-class problems.

The sci-kit learn documentation example shows a 3 by 3 matrix. Both raw data and normalised data is shown.

  • The diagonal elements represent the number of points for which the predicted label is equal to the true label. Off diagonal elements are miss-classified by the classifier. Higher diagonal values are better as they indicate many correct predictions.

Ten class example from sci-kit learn documentation

  • Not labelled. Sc-kit learn convention is the true values are on the left and the predicted values are on the top.
  • Convention is alphabetical or numeric order starting top left.
  • Diagonals are the correct predictions. Let's pick 5 - the largest number not on the diagonal - and what that means. Actual value of 3 and predicted value of 8 - represents 5 cases where true value was 3 and the actual was 8. (This makes sense because it was an image classification problem and 3s and 8s look similar). In this model, 8s did not get predicted as 3s.
  • Looking at the report - we see that 3s have the worst recall. This makes sense from the confusion matrix.

Choice of metric depends on business problem

  • Ultimately this is a value judgement about which errors to minimise.
  • Example of email spam filter (where the positive class is spam). False negatives are more acceptable than false positives as we'd be happier with the occasional bit of spam rather than missing an important email. Here we would optimise for precision or specificity.
  • Example of a fraud transaction detection (where the positive class is fraud). Here false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (where fraudulent transactions are not detected).

Sources