Week 09 (Jan18) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary

We visualized clustering around attributes we can use for prediction for white chocolate, milk chocolate and all foods and updated the visualization for dark chocolate to be more concise and include more attributes. We looked into using more attributes for prediction, including attributes that aren't numbers, but the high amount of N/A values made this futile. And we also tried to visualized clustering with attributes that we used for predicting Juices and Non - Juices Beverage.

Next, we built random forests for white chocolate and milk chocolate, which don't perform as well as our predictor for dark chocolate. We might remove the white chocolate predictor as we have very few samples of white chocolate.

Then we used 2 methods (Naive Bayes and Random Forests) to built the model for predicting Juice and Non - Juice Beverage and see which method is better for this dataset.

1 - Weekly Work

1.1 Visualization of attributes to find clusters for prediction

all-foods First we created this visualization of all foods as a reference value, since clusters are only useful for prediction if they're unique to the item we want to predict.

Dark-Chocolate We already visualized dark chocolate last week, but used several different graphs and fewer attributes. Now we have all of them in one graph. We can see clearly that especially salt and proteins in dark chocolate differ greatly from the distribution of all foods and should therefore be useful for predicting dark chocolate.

White-Chocolate All attributes differ from the distribution of all foods. In salt and fat white chocolate is very similar to dark chocolate, whereas it seems to differ in sugar and carbohydrates. This could also be due to the low number of white chocolate samples.

Milk-Chocolate Milk chocolate seems very similar to dark chocolate, except that there are fewer items with low sugar or carbohydrates. Also the cluster around fat is tighter.

1.1.1 Visualization of attributes to find clusters for Juices

Juice and Non Juice
There are a lot of empty rows in the dataset, so after removing the empty rows in the dataset and collect the Juice data and non Juice, we found that there are 1953 data of juice in dataset.

Energy to fruit.nut.vegetable
After extracting beverage data to whole dataset, it can be seen that (Juice and Non Juice in beverage), Energy to Fruit.Nuts.Vegetables. attributes does not really show clustering. Because of a lot of beverages produce small energy.

Sugar to fruit.nut.vegetable
We tried to visualize fruit.nut.vegetable with another attribute and now is sugar. We can see a little bit clustering around 12.5g of sugar and from 25g to 100g of Fruits.Nuts.Vegetables. Most of the juice lay on that range.

1.2 Random Forests for White Chocolate and Milk Chocolate

First we created a random forest to predict white chocolate using energy, fat, sugar, salt, carbohydrates and proteins. The prediction quality was very poor:

recall: 0.133
precision: 1.0
f1_score: 0.235

We tried different combinations of attributes and adding new attributes like the nutrition score. We also tried using a larger forest (500 trees instead of 100). Sadly, all of these made the prediction quality even worse. We also looked into using non-number attributes like the packaging, but the high amount of N/A values reduced the sample size too much. This is especially a problem for white chocolate, as we only have 76 samples of white chocolate in our database. Due to this low number we are considering to just drop white chocolate prediction altogether.

Next we created a random forest to predict milk chocolate. After some optimization we found that we got the best f1-score using sugar, salt, fat, energy and proteins (no carbohydrates) and 500 instead of 100 trees for prediction. Using these parameters we achieved the following results:

recall: 0.487
precision: 0.661
f1_score: 0.561

Recall below 50% is unsatisfying. If we remove energy as a prediction attribute, we can boost recall in exchange for precision with a marginally lower f1-score:

recall: 0.504
precision: 0.628
f1_score: 0.560

1.3.1 Random Forests for Juice and Non - Juice using whole dataset

First we separate the dataset into train set (60%) and test set (40%) with attributes energy, saturated fat, sugar, fibre, protein, fruits_vegetables_nuts and sodium and the data contained 1963 Juices and 69234 non - Juices . The predictions was too good with low precision:

Number of Trees = 500
Accuracy = 0.985
Precision = 0.588
Recall = 0.850
F1 score = 0.695

1.3.2 Random Forests for Juice and Non - Juice using data that contains "Beverage" in categories_tags

The accuracy was really high so we tried to extract the data that categorized_tags contain beverage and split the data into train, test and validation with ratio 70%, 15%, 15% of the dataset with the same attributes. The data contained 1925 Juices and 25852 non - Juices. Then, we got this value:

Number of Trees = 500
Accuracy = 0.966
Precision = 0.633
Recall = 0.863
F1 score = 0.730

By splitting using dataset that contains only beverage in categorizes_tags and splitting the dataset into train, test and validation set, the accuracy decreased a bit but it increased the F1 score and Precision score.

1.3.3 Naive Bayes for Juice and Non - Juice using data that contains "Beverage" in categories_tags

We also used the same attributes as we did with Random Forests. The data contained 1925 Juices and 25852 non - Juices and then we got these values:

Accuracy = 0.5232
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.975
Specificity : 0.489
Pos Pred Value : 0.124
Neg Pred Value : 0.996
Balanced Accuracy : 0.732

By looking at the accuracy result that only got a little bit more than 50%, it shows that building the model using Random Forest is more suited to this dataset. Using we used Naive Bayes we got so many warnings (50 warnings), because of alot of NA values in the dataset.

1.3.4 Random Forests for Plant-based products

We derived a dataset, based on Plant-based products( 23443 in total which exists in the database which are marked) having energy, fat, sugar, salt, carbohydrates and proteins values, from the actual dataset. We created a training, validation and test dataset out of it based on 60-20-20 division method. Then trained our forest model on it and got:

recall: 0.794
precision: 0.862
f1_score: 0.826

We tried different combinations of attributes and adding new attributes like the nutrition score, fibres and few others. With additional attributes, our prediction was even worse. We also looked into using non-number attributes like the packaging, but the high number of N/A values reduced the sample size to too small.

Then we tested our trained model with test dataset and got these results:

recall: 0.790
precision: 0.856
f1_score: 0.822

We then took a dataset out of the original dataset where categories had missing values and tried to label them with the help of our model and the results were:

Sample: 2315 (which had energy, fat, sugar, salt, carbohydrates and proteins as non NaN values)
Predicted: 494 rows as plant-based foods.

We repeated this with other categories such as Sugary ( 10190 in total which exists in the database which is marked) and with our test dataset we got:

recall: 0.875
precision: 0.875
f1_score: 0.875

Here we observe that the accuracy increased with the decreasing number of already labelled items.

We repeated this with other categories such as Fruits ( 7128 in total which exists in the database which is marked) and with our test dataset we got:

recall: 0.656
precision: 0.752
f1_score: 0.701

Here we can observe the accuracy got decreased with the decreasing number of already labelled items.

2 - Future Work

Combining the chocolate sub-type predictors into one chocolate predictor
Making a simple GUI

Presentation