Module 4 5 Evaluation Precision Recall and F1 Score - iffatAGheyas/NLP-handbook GitHub Wiki

Module 4.5: Evaluation Metrics – Precision, Recall & F₁-Score

After training a classifier, it is crucial to evaluate its performance. For binary or multiclass classification, common metrics are:

  • True Positives (TP): correctly predicted positive examples
  • False Positives (FP): negative examples incorrectly predicted positive
  • False Negatives (FN): positive examples incorrectly predicted negative
  • True Negatives (TN): correctly predicted negative examples

From these we derive: image

1. Evaluation with scikit-learn

Using the spam/ham example from Module 4.2 or 4.3, suppose:

tests      = ["limited time offer", "lunch with project team",
              "win cheap money", "report meeting tomorrow"]
y_true     = ['spam', 'ham', 'spam', 'ham']
y_pred     = ['spam', 'ham', 'spam', 'ham']  # replace with actual model predictions

Compute metrics and show a confusion matrix and classification report:

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    precision_score,
    recall_score,
    f1_score
)

# 1. Confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=['spam','ham'])
print("Confusion Matrix:\n", cm)
# Rows = true classes, columns = predicted classes
#       Pred: spam  Pred: ham
# True:
#  spam   [[ TP,      FN ],
#  ham     [ FP,      TN ]]

# 2. Classification report
print("\nClassification Report:\n",
      classification_report(y_true, y_pred, target_names=['spam','ham']))

# 3. Precision, Recall, F1 (binary)
prec = precision_score(y_true, y_pred, pos_label='spam')
rec  = recall_score(y_true, y_pred, pos_label='spam')
f1   = f1_score(y_true, y_pred, pos_label='spam')
print(f"Precision (spam) = {prec:.2f}")
print(f"Recall    (spam) = {rec:.2f}")
print(f"F1 Score  (spam) = {f1:.2f}")

# 4. Macro- and micro-averaged scores (multiclass)
print("Macro-avg F1:", f1_score(y_true, y_pred, average='macro'))
print("Micro-avg F1:", f1_score(y_true, y_pred, average='micro'))

Output:

image

2. Manual Computation Example

For deeper understanding, metrics can be computed by hand:

# Example predictions
y_true = ['spam', 'spam', 'ham', 'ham']
y_pred = ['spam', 'ham',  'ham', 'spam']

# 1. Count TP, FP, FN, TN for 'spam'
TP = sum(1 for t, p in zip(y_true, y_pred) if t=='spam' and p=='spam')
FP = sum(1 for t, p in zip(y_true, y_pred) if t=='ham' and p=='spam')
FN = sum(1 for t, p in zip(y_true, y_pred) if t=='spam' and p=='ham')
TN = sum(1 for t, p in zip(y_true, y_pred) if t=='ham' and p=='ham')

precision = TP / (TP + FP) if TP + FP > 0 else 0
recall    = TP / (TP + FN) if TP + FN > 0 else 0
f1        = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0

print(f"TP={TP}, FP={FP}, FN={FN}, TN={TN}")
print(f"Precision={precision:.2f}, Recall={recall:.2f}, F1={f1:.2f}")

Output:

TP=1, FP=1, FN=1, TN=1
Precision=0.50, Recall=0.50, F1=0.50

Continue to [Module 5: Neural Network Fundamentals] (https://github.com/iffatAGheyas/NLP-handbook/wiki/Module-5-Neural-Network-Fundamentals)