Week 06 (W51 Dec21) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
0 - Summary
Previous week we looked into attributes "Sugar" and "Energy" distribution in beverages and now we tried to use other attributes to calculate the Nutrition Grade in beverages ( the formula for calculating Nutrition grade given by Nutrition Grade Formula ). Based on the website, the formula for calculating the Nutritional score were sent to openfoodfact contributor by the team of Professor Hercberg. This formula has been the subject of studies and adaptations for the French market.
To find Nutrition Grade in beverages, we need to find the value of "Sugars per 100g", "Energy per 100g", "Fruits, Nuts, Vegetables per 100g" then put it into the threshold that the website given, after that we could see the Nutrition Grade in the beverages, except beverages that contains "Milk". Based on Nutrition Grade Formula website the formula for calculating beverages that contains "Milk" we need to use formula for calculating other beverages products Nutrition Grade. So then we also looked into "Saturated Fat per 100g", "Sodium per 100g", "Fibre per 100g", "Protein per 100g". We only used beverages that have quantity 330ml - 500ml except Milk and Juices.
We also moved toward being able to make predictions using a random forest with scikit learn.
1. Weekly Work
1.1 We divided beverages into 5 categories, such as:
- Soda Drinks :
- Coca Cola
- Sprite
- Pepsi
- Diet Coke
- Morning Drinks :
- Milk
- Chocolate Milk
- Coffee Drink
- Chocolate Drink
- Healthy Drink :
- Tea
- Juice
- Smoothie
- Protein and Energy Drink :
- Protein Shake, Body building
- Power Drink
- Other Drinks :
- Alcoholic Drink
- Sirop
- Coconut Drink
1.2 Formula and Threshold for calculating Nutrition Grade
Point A
A points are the sum of points for energy, saturated fat, sugars and sodium.
Point C
The point C is the sum of points for fruit, vegetables and nuts, for fiber and protein.
Nutrition Score Calculation (for non beverages)
- If points A < 11 then score = point A - point C
- If point A ≥ 11
- If the points for fruit, vegetables and nuts = 5, then score = point A - point C
- If the points for fruit, vegetables and nuts < 5, then score points A = - (fiber + points for points fruits, vegetables and nuts
Notes: I don't really understand for the calculation if the fruit, veggies and nuts < 5 and I already ask the contributor of the openfoodfact website and he didn't really know it so well, he just send me the link of the API for calculation but it doesn't contained much information, API Link.
Threshold for Beverages
Nutrition Grade
1.3 Visualization of attributes for every categorized beverages
We created bar plot to see the Nutrition Fact that contained in beverages and later we could see in the threshold given in the website to see the Nutrition Grade for each beverages.
- Soda Drink
The graph shows the value of "Energy per 100g (kJ)", "Sugars per 100g", "Fruit, Veggies and Nuts per 100g", "Fibre per 100g", "Protein per 100g", "Saturated Fat per 100g", "Sodium per 100g", it is because these beverages are non Milk contained beverages, so we can calculate the Nutrition Score using beverages threshold. For example, Coca - Cola has Energy = 172 kJ, Sugars = 8.76g and Fruit, Veggies, Nuts = 0g. From that we can see in the threshold and we can check the point Coca - Cola got is 6 and we look at the Nutrition Grade for Beverages is "D". Another example, Diet Coke has Energy = 1 kJ, Sugars = 0g and Fruit, Veggies, Nuts = 0g. From that we can say that Nutrition Grade for Diet Coke is "B". - Morning Drink
It categorized "Morning Drink" because in the attribute "categories_tags" it described as "en:beverages,en:breakfasts,en:sugary-snacks,en:hot-b...".
The graph shows for Chocolate drinks (Nesquik, Quick Cao, Grand Arôme (32% cacao), Choco Quick, etc) have average value of Sugar = 60g, Energy = 600 kJ and Fruit, Veggies, Nuts = 0g. By that values we can check on the Thresholds for calculating the Beverages Nutrition Score is "E". And for Milk, as we know from the website we can not use formula for calculation beverages so we use the other formula. For Milk (Haltbare Vollmilch, Frische Vollmilch, Whole British Milk, H-Vollmilch, etc.) have average value of Energy = 290 kJ, Saturated Fat = 2.5g, Sugar = 6.15g, Sodium = 0.05g by this we can assume that score for point A is 2 and point A is < 11. After that, we have to find score for point C. Average value of Fruit, Veggies, Nuts = 0g, Fibre = 0.5g, Protein = 3.5g by these values we can assume that the score for point C is 0. So Nutrition Score can be calculated by point A - point C which is 2 - 0 = 2. If we look at the Beverages Threshold Nutrition grade for score 2 is "C". If we look at the dataset the Nutrition Grade for Milk is either "B" or "C". - Healthy Drink
The graph shows for Juices (Pur Jus d'Ananas, Jus d'orange, Le Jus de nos Régions - Multifruits, etc), those juices have average value of Sugar = 12.75g, Energy = 200 kJ and Fruit, Veggies, Nuts = 100g. From that values we can assume that the Nutrition Grade of those brands juices are "D". And if look at the dataset Nutrition Grade for juices is either "C" or "D". It is weird because it makes Diet Coke is healthier that juices based on this formula. - Protein and Energy Drink
Protein drink categorized as beverages described in "categories_tags" is "en:dietary-supplements,en:bodybuilding-supplements" and Power Drinks are Energy Dark Dog, Red Bull, Rockstar Energy Drink, etc. - Other Drink
1.4 Predicting whether an item is chocolate
We used the sugar, fat, salt and energy content per 100g to predict whether an item is chocolate. For this we excluded all columns that contained a NaN value in one of the relevant fields and created a random forest using scikit learn. The forest was fit with a training set created with the train_test_split function and tested with the cross_val_score function provided by scikit learn. This lead to really high accuracy scores of about 97% - 98%.
These scores seemed unrealistic so we used the same forest to predict chocolates in the subset of foods that are tagged as candy and then counted the items with the word 'Chocolate' in their categories field. 60 items were predicted to be chocolate whereas 294 items had 'Chocolate' in their categories. All of the items that were predicted to be chocolate also had 'Chocolate' in their categories, so they are most likely true positives. That still means that the forest does poorly at identifying chocolate, which implies that the high accuracy scores come from correctly classifying non-chocolates as non-chocolates.
To improve the ability of our forest to identify chocolate we might need to specialize more and predict sub-types of chocolate (like dark/white/milk chocolate) and/or use more fields for the predictions than just fat, sugar, energy and salt contents. If we use our test set to optimize our forest this way we run the risk of overfitting, which is why we need a validation set in addition to the training and test sets.
2. Future Work
- Play around with our machine learning code: Try to specialize more (e.g. predict dark chocolate instead of chocolate), try to predict different attributes, etc.
- Use a training, validation and test set instead of only training and test