Classification training v1 - dnum-mi/basegun-ml GitHub Wiki

Dataset v1

Here are the changes v1 has comparatively to v0:

  • We manually re-visualized all photos in dataset v0 to check each photo has the correct label.
  • For class/typology bolt-action epaule_a_verrou, we deleted the images where the bolt was not visible (often the left profile)
  • We separated former autre_epaule in 3 classes : epaule_semi_auto_style_chasse, epaule_semi_auto_style_militaire_militeu_20e, semi_auto_style_militaire_autre➡️ dataset v1 has 12 classes/typologies, as opposed to 10 for v0
  • We moved pistolet mitrailleurs from autre_pistolet to semi_auto_style_militaire_autre
  • We added new images to all classes where we had less than 5000 images.
  • Some classes were renamed : "..._a_percussion_silex" ➡️ "..._a_mecanisme_ancien"

image

Training strategy

  1. First, we tried using exactly the same config as the algo in prod: EffnetB4, 30 epochs, batch size 256, OneCycleLR 0.005 etc. The result was significantly less performant than the same algorithm trained on dataset v0.
Confusion matrix for algo trained on Dataset v0 Confusion matrix for algo trained on Dataset v1

We noticed several things:

  • the overall performance of the algo decreased due to the introduction of new classes which are harder to identify. epaule_semi_auto_style_chasse (semi-auto long guns hunting style) and especially epaule_semi_auto_style_militaire_milieu_20e (semi-auto long guns military WW2 style) have bad scores which drag the accuracy score down.
  • the introduction of new classes also impacts negatively other classes: epaule_a_pompe (pump-action) had an accuracy of 0.58 with dataset v0, and decreased to 0.55 with v1, because in v1 it gets more confused with the epaule_semi_auto_style_chasse which were part of class autre_epaule in v0, thus in smaller number when computing accuracy for v0.
  • for former classes which are not confused with the newly added classes, the addition of new images were a large benefit : ancient long guns epaule_a_mecanisme_ancien raised from 0.57 to 0.81, various pistols autre_pistolet from 0.55 to 0.61 .
  • for class bolt-action long guns epaule_a_verrou, our meticoulous verification of dataset v0 seemed to have no impact, the performance of the algorithm being pretty much the same between v0 and v1 ➡️ this raises the following hypothesis: probably, the algorithm does not look at the bolt to classify images in this class, and must look at other elements like the barrel shape or the material of the stock. If this hypothesis is true, it is very problematic because it means the algorithm is not reliable for this class.
  1. First we wanted to verify whether the bad statistics were due to the introduction of new classes, or also the introduction of new images for former classes. We completely removed the classes epaule_semi_auto_style_chasse, epaule_semi_auto_style_militaire_militeu_20e, semi_auto_style_militaire_autre (formerly autre_epaule in v0) to check if the scores for the other classes would stay the same, increase or decrease compared to v0.
Algo trained on Dataset v0 Algo trained on Dataset v1 w/o autre_epaule

/!\ This comparision is not perfect since the best would have been to compare with an algorithm trained en v0 without class autre_epaule. For instance the score of epaule_a_pompe in left matrix is impacted by the fact that autre_epaule can be a source of confusion, while it's not the case for right matrix. Still we see all former classes have a far better score in v1 which semms to confirm the hypothesis that the overall decrease of performance is the scission of autre_epaule in 3 classes which are hard to distinguish.

  1. Since we had a suspicion that for long guns the algorithm does not properly look at the mechanism and look at other factors (overall style, stock material, etc), we tried to change the dataset so that it understands that for most guns, the center part is the most important for identifying the mechanism. Instead of feeding the algorithm with the full picture at training, we changed the data augmentation step so that it crops randomly the center part at a random size.


Examples of images after data augmentation. The stock and barrel might be cut, but the main mechanisme is almost always visible.

This gave much better results.

Algo trained on Dataset v1 for full pictures Algo trained on Dataset v1 with cropped center

Besides, when the algorithm was trained on full images, the scores on the validation set set usually better than on the training set, which is quite unusual in training a deep neural network. Now with the cropped images (dark blue curves) the accuracy for training is closer than before to validation, and inversely for the loss curves. This can be interpreted as a better way to train the model.


In orange : algo trained on full images dataset v0, in yellow : algo trained on full images dataset v1, in dark blue : algo trained on cropped images dataset v1

We tried several variations of the data augmentation step, in the end the best result is doing a random resize between Nx1.2 and Nx1.5 then do a random crop size N, where N is the input size of the network. We tested changing the order of transforms in the data preparation step for training set:

  • resize
  • crop
  • rotation
  • perspective
  • color jitter Putting rotation & perspective right after resize decreases the performance. Putting color jitter right after crop does not change the performance. In the end the list above is the order giving the best results
  1. We increase the complexity of the network. In the tests of previous version, we noticed that the models with the best performances in classification (while still being light enough to predict on CPU) were Efficients in increasing order from B0 to B7 (except B5 which gives poorer performances than B4). Therefore this time we only test EfficientNetB4, EfficientNetB6 and EfficientNetB7.

  2. For EfficientNetB4 the model seems to have reached a ceiling before 30 epochs, but for EfficientNetB6 and B7 it seems like it could increase. We made some tests to change the max_lr of scheduler OneCycleLR and the number of epochs. We see that in the end, as found in the tests of previous version, OneCycleLR max=0.005 with 30 epochs still give the best performance. image

  3. We wondered merging epaule_semi_auto_style_militaire_milieu_20e (semi-auto long guns military WW2 style) with another class. Indeed, (i) the classification algorithm remained pretty unsure on this class (accuracy <0.5), (ii) we had much fewer images for this class than the others (see graph on dataset v1 distribution above), (iii) it is a class very hard even for human experts to caracterize precisely. For the French law, these firearms are most of the time classified in A or B before their barrel can hold a lot of ammunition which can be fired semi-automatically, sometimes automatically. For these reasons we tried to merge them with semi_auto_style_militaire_autre, giving a new class name "semi_auto_style_militaire" (semi-auto military style). image

  4. Like in the previous version of the algorithm we noticed that EfficientNetB4, B6 and B7 (not B5!) give the best results. Since Torchvision==1.14 (Pytorch==1.13) the algorithm EfficientNetV2 is included in the library. In our state of the art research we intuited that this types of algorithm would not bring much improvement compared to classic EfficientNets, but we prefered testing it to make sure. image Our tests show that EffNetV2_s and EffNetV2_m give poorer results than regular EfficientNets. We did not test EffNetV2_l since its number of parameters exceeded EffNetB7.

⚠️ **GitHub.com Fallback** ⚠️