Classification dataset - dnum-mi/basegun-ml GitHub Wiki

Origin of images

Some images were obtained through partenarships with gun sellers or governmental databases.

Gun sellers:

Governmental databases

For privacy reasons we cannot disclose more details of the origin of the images and method of collection, for more information please contact [email protected] or ask access to this repo.

Quality verification

Labeling

Most images have gone through a manual labelization process. Labelers were members of the Ministry of the Interior with varied knowledge levels in firearms identification. Therefore the labeling task could not require too much expert knowledge. Most images were already pre-classified in folders for each class, we asked then labelers to put the images in one of 2 categories : "valid image" or "invalid image". A valid image must contain the entire firearm in the image frame, the firearm should be complete (no missing parts), finally there should be only one firearm per image. Any other image is invalid.
For images where we had doubts on their correct belonging in a class folder, we asked firearm expects from SCAE (French regulation of firearms) to help us categorize them.

[FUTURE] Automatic reassignment

Altough we did our best to avoid classification errors in the dataset, there are always mistakes remaining. Therefore, once we have a reliable version of classifier model, we will be able run it on the dataset to try to detect firearms which were put in the wrong category.

Dataset organisation

The dataset main folder has 2 subfolders:

  • train: images reserved for the training of ML model, which will teach it how to improve its predictions
  • val: images reserved for evaluating the quality of ML model during and after the training.

We choose from the beginning to separate the train/validation subsets for the following reasons:

  • we can manually check the correct distribution of firearm images in validation, for example that in the “other gun” category there are not only Derringers
  • Pytorch (the Python ML framework we use for training Basegun models) basically expects the Dataset to be organized like this
  • if train/val datasets are fixed (and not created randomly) we are sure 2 training with the same parameters will be done in the exact same conditions.

Besides, in March 2023 for v1 of algorithm we tried to see if a random train/val set would be better than our manually created one. The manual train/val (in yellow below) gave worse results than the random train/val (in orange) on training set, but better results on validating set. This could be expected since we made sure that all validation images could be correctly identified y the algorithm (e.g. the mechanism was visible) while we did not make equivalent efforts on the 40 000 pictures of training set.

Accuracy on training set Accuracy on validation set
image image

Then, for each train/val folder, we have one subfolder for each class we have decided to use for the classification task. For instance, in dataset v0 each train/val folder has 10 subfolders.

Dataset versions

A copy of each of these versions have been stored in 2 places:

  • a private OVH data storage container "basegun"
  • the Ministry of Interior NAS of DNUM direction.