4. Pseudo Labeling Data Augmentation - SummerBigData/Iceberg GitHub Wiki

We are provided over 8000 images without labels for calculating a kaggle score. This is easily over four times as much information as the training set. So, after reading this, I learned how pseudo-labeling can help boost performance when there is a lot of unlabeled data. The idea is that you take your best model and weights from CNN and feed it the unlabeled images to get labels. We call these pseudo-labels, since they aren't 100% accurate. Now, we take this pseudo-labelled data and augment the data set with it. This method is supposed to help introduce more features about your classes to the CNN that it wouldn't have seen otherwise. The protocol calls for 1 part pseudo-labelled data to 3-4 parts labelled data. I tested this with different amounts of pseudo-labelled data:

As we can see, there seems to be an initial dip in the scores, following a rise and another fall. Given that the accuracy is likely around a percent, I think repeated trials would help solidify what is going on here. I do believe this method is at least slightly successful. The other interesting thing here is that the training set's accuracy steadily dipped as more data was given. I think this is because using pseudo-labelled data contaminates the dataset in some irrecoverable way.

The future goal here is to solidify the testing set line with more trials, pick the one that did best, and repeat the process by augmenting again with the remining dataset. By repeating this process, we are building on our pseudo-label accuracy, which improves our training data accuracy.