Week 08 (Jan11) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
0 - Summary
We changed our method of evaluating how good our predictions are from accuracy to precision/recall/f1-score. Accuracy is nearly meaningless because the vast majority of items in our database are going to be true negatives if we predict chocolate, thus inflating accuracy even if our prediction performs poorly on predicting actual chocolate items.
Next we visualized chocolates to see whether we could find clusters that would indicate whether chocolates can be predicted.
Then we created a random forest for predicting dark chocolate instead of all chocolate types and applied the above to it as well in addition to trying out different combinations of attributes for prediction.
1 - Weekly Work
1.1 Evaluating Last Weeks Prediction of Chocolate
Last week we trained a random forest to predict chocolate and tested it with a test subset of the entire food database and with our known-sweets subset of the database. Due to the overwhelming amount of non-chocolate items the predictor had a very high accuracy score. Now we calculated the precision, the recall and the f1-score (which combines recall and precision into one metric). The predictor managed to find ~64% of chocolate items from the food database (recall) and ~69.2% of items that were predicted to be chocolate actually are chocolate (precision). This combines to an f1-score of ~0.665, which is better than we feared it would be. Still, the current predictor leaves much to be desired.
To get a sense of how useful the attributes we used to predict the chocolate are, we visualized the distribution of chocolate based on these attributes. Our hope was to find tight clusters that would indicate that chocolate would be easy to predict using these attributes.
There's a bit of a cluster around ~1700 to 2600 J of Energy, but surprisingly little clustering around sugar.
Fat doesn't look promising, but there's a tight cluster around salt between 0g and 2g.
Sadly a low-salt cluster happens for all foods as well. Still, the very low salt content of just 0-2g seems to be a characteristic that can be used to identify chocolates, since the entirety of food clusters around 0-10g.
1.2 Predicting Dark Chocolate
Our next step to improve the prediction quality was to try to be more precise in what we want to predict, so we started with dark chocolate. We also visualized dark chocolate with the attributes we used to predict all chocolates to find clusters.
We compared the distribution of dark chocolates to the distribution of all chocolates on sugar and energy. Luckily, the cluster for energy is much tighter and clearer with dark chocolate. It also clusters around 20g - 55g of sugar. This makes dark chocolate prediction based on these attributes look more promising than regular chocolate prediction.
On the other hand it has to be considered that there are only 724 dark chocolate items in our dataset. The low sample size might be an issue.
Next we trained a couple of random forest with a training subset of dark chocolate items using different attributes. Using the same attributes we used for predicting regular chocolate [fat, sugar, salt, energy], we achieved a recall of ~68% and a precision of ~74%, which combines to an f1-score of 0.70. This is an improvement of ~0.035 over our chocolate predictor, which is not bad, but not great either. So now we tried using different attributes for prediction. Adding "sodium per 100g" mildly improved precision, but damaged recall and lowered the overall f1-score. Surprisingly, the same thing happened when we removed "fat per 100g" even though there was little clustering around fat, suggesting that it wasn't a valuable attribute for predicting dark chocolate. Adding "proteins per 100g" and "carbohydrates per 100g" both improved our prediction, combining to and f1-score of 0.77 (precision: ~80%, recall: ~75%).
We could also look into using attributes that are not numerical, like allergens.
2 - Future Work
- Predict other chocolate-subtypes
- Assemble subtype predictions into one big chocolate predictor
- Play around with more attributes to improve prediction quality