multiclassification evaluation - taoualiw/My-Knowledge-Base GitHub Wiki
There are two cases :
- Averaging binary statistics multiple dataset. In which case the micro-averaged precision and recall are different. http://rushdishams.blogspot.com/2011/08/micro-and-macro-average-of-precision.html
- Averaging binary statistics from multiple classes.
Here we consider the second case. In the case of multiclassification it is difficulat to define FP,FN,TP,TN for non-binary data (we have more than two classes).
Thus the statistics are averaged over the classes in two ways :
- micro: Calculate metrics globally by counting the total true positives, false negatives and false positives. The total FP and The total FN are equal in this case, leading to recall=precision=F1=F2
- macro: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).
Example :
Let's take for example precision 𝑃𝑟 = 𝑇𝑃 / (𝑇𝑃+𝐹𝑃). realLabels = [1,1,1,1,1, 2,2,2, 3,3,3,3,3,3] estimatedLabels = [1,2,2,2,2, 3,2,2, 1,3,2,2,2,2] Class1 : TP 1, FP 1 Class2 : TP 2, FP 8 Class3 : TP 1, FP 1
micor-average Pr = (Pr1+Pr2+Pr3)/3=(0.5+0.2+0.5)/3 = 0.4 macro-average Pr = (1+2+1)/((1+2+1)+(1+8+1)) = 0.28 In macro-average Class2 effect is higher than is micro-average
- Python Example
from sklearn.metrics import f1_score
a = f1_score(realLabels, estimatedLabels, pos_label=1, average='micro', sample_weight=None)
b = f1_score(realLabels, estimatedLabels, pos_label=1, average='macro', sample_weight=None)
print("micro-avg",a,"macro-avg",b)
micro-avg 0.5 macro-avg 0.47301587301587295