Dupes: Training Dataset Corrections via Iterative Trainings - WHOIGit/ifcb_classifier GitHub Wiki

Improvements to a training dataset can lead to improvements in training results. This can be tedious and error-prone to do by hand even for smaller datasets. This section describes a method of leveraging the stochasticity of model training to identify potentially mis-labeled training data.

Premis

If we perform a model training with the same input parameters n-times, and a particular datum gets chronically and consistently mis-classified as some other class each time independently, maybe that datum was incorrectly labeled in the training set and actually IS that other class. Correcting such errors in the dataset will improve subsequent trainings.

Methodology

In order to assess the whole dataset we need to be able to run the test/validation/assessment process on the whole dataset. To do this we run two trainings where the dataset has been split 50:50 on a per-class basis (as opposed to the typical 80:20 training:testing ratio); for the two training runs, the testing and training data are swapped such that the final validation outputs for this pair of training runs encompases the whole dataset.
We run the above training pair 10 times and aggregate the results and flag any image that has been chronically mis-classified with the same “erroneous” label. The threshold for this can be arbitrarily set at 10, 9, or 8 times. We call these chronically mis-labeled images “dupes”. Dupes (images that have been chronically and consistently mislabeled) are then exported/copied to a directory with a similar structure as the original dataset (dupes placed in same subdirectories/class-labels as those in original dataset). Filenames however are prefixed with the other-class (and number of times) it was found to be duplicitously classified as. The files in this new dupes-directory are recorded to a “before” text file for later comparison. A technician goes through the dupes-directory and deletes any datum/image that truly DOES belong in that directory. By removing them from the dupes-directory structure, they will NOT be corrected/reclassified. (More reclassifications options available to the technician here, to be detailed later) Once assessed, the dupes directory is re-scanned to an “after” text file. By comparing the “before” and “after” text files, a list of files-to-be-updated in the original dataset is generated. Files that were chronically mis-classified but were found by the technician to truly HAVE been correctly labeled are also noted, such that these files can be ignored if this methodology is to be repeated again later to further improve the dataset. A new dataset with re-classified or expunged dupes is exported.

Results

Using the above method we were able to identify images that had been incorrectly classified. By correcting mis-classified images in our dataset we were able to improve our classification results by a few percentage points. Dupes testing is a valid endeavor for identifying data which may have been incorrectly classified in the original dataset.