Week 10 (Jan25) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary

We worked on improving our white chocolate prediction by removing non-white-chocolate items from our database or adding multiple copies of white chocolate items before training our random forest.

1 - Improving White Chocolate Prediction

Last week we tried to predict white chocolate and failed miserably with an f1-score of about 0.23. We wanted to improve our prediction this week.

1.1 - Removing non-white-chocolate items

First we tried randomly removing items that aren't white chocolate from the database before training our forest. All of the following values were calculated several times so one lucky/unlucky set of random removals doesn't trick us.

We started by randomly removing 1/50 of non-white-chocolate items, which made very little difference to our prediction quality. We achieved an f1-score of about 0.22 (~0.01 worse than before). Next we tried removing more items, up until we removed 1/10 non-white-chocolate items. This made our f1-score even worse, ending up at 0.09 with very low recall of 0.05 and 0.5 precision.

So now we tried moving into the other direction and removing fewer items. With removing 1/300 non-white-chocolate items we achieved an f1-score of 0.25(~0.02 better than training with the full database). Moving too far in this direction worsens the prediction again. Removing 1/500 non-white-chocolate items yields an f1-score of 0.09 to 0.19.

So removing non-white-chocolate items didn't help us very much.

1.2 - Multiplying the white chocolate items in the database

Since we have very few white chocolate items in our database, multiplying them might help our predictor to form a model of what white chocolate is. In our first attempt we took every white-chocolate item and added it back into the database 20 times. With this we achieved amazing results:

recall: 1.0
precision: 0.99
f1_score: 0.99

This result looks to good to be true, and it is. We checked our test set and our training set and found that all items in the test set were also present in the training set - so the algorithm didn't actually find any new items, it just recognized the ones it was trained with.

To remedy this issue we took half of the white-chocolate items out of the database before applying the multiplication and manually added them to the test set (also multiplied). The other (multiplied) half of almost 800 items was split randomly across the training and test set. After this there should be plenty of white-chocolate items in the test set that our algorithm hasn't seen during training. We tested our prediction again and achieved these satisfying results:

recall: 0.78
precision: 0.99
f1_score: 0.88

1.3 - Simple GUI

About OpenFoodFact team

We made a simple GUI of our worked for showing our task more attractively.

Main page describing about OpenFoodFact team and the purpose of this project.

This page is describing Our team (WorldFoodFact) and about the dataset (size, attributes, link to the dataset).

Next page shows the histogram of sugar in dataset and the histogram can be tune by scrolling to change the bin of the histogram.

This page shows the wordcloud of allergens attribute and the wordcloud can be tune by scrolling, Minimum frequency is the minimum frequency of word that in dataset and maximum word is maximum word that display in the wordcloud.