Week 03 (W48 Nov30) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
0 - Summary
- Gathered foods classified as "sweets" or "candy", inspected those sweets & compared them to the rest of our dataset
- Important attributes:
- countries
- fat_100g
- allergens
- sugar_100g
- carbohydrates_100g
- energy_100g
- serving_size
- additives
- packaging
- sodium_100g
We decided to choose these attributes to become important attributes, because from those attributes we could derived information about which countries that sell foods that contains high fat per 100g, which allergens are contains in foods and we could visualize relationship between attributes for example sugar_100g and energy_100g, from that we could make a statement is it true high sugar contains more energy, etc.
- Not important attributes:
- code
- url
- creator
- nutrition_grade_uk
- ingredients_from_palm_oil
- X.butyric.acid_100g
- X.sucrose_100g
We decided those attributes to be not important because for example, attributes code, url and creator, we could not find any important information from that attributes (for now) and for ingredients_from_palm_oil, nutrition_grade_uk, X.butyric.acid_100g and X.sucrose_100g, we found that the missing values are 106340 cells which is all the cells in those attributes are missing.
1 - Weekly Work
1.1 Gathering and inspecting food classified as "sweets"
The dataset contains a column called "categories_tags". Some of the tags can be used to classify food as "sweets", such as "en:desserts", "en:chocolates" or "en:pies". Using these tags we managed to find 661 sweet foods. Considering there are 106458 foods in the dataset it stands to reason that many sweets are not tagged as such and need to be found by other means.
To achieve that we inspected the sweets we found through tags and compared them to all other items. First we plotted a general comparison. All nutritional values are per 100g, each bar represents the average value of the respective set. Unsurprisingly, sweets contain more sugar and less salt than other foods. What wasn't expected is the higher average nutrition score.
Of course these averages can be misleading. For example if cake and pies contain a large amount of carbohydrates and other sweets contain a low amount, the average will show a medium amount of carbohydrates. If we look for items with medium amounts of carbohydrates we will miss both cake/pies and other sweets.
To avoid that mistake we created histograms to examine the distribution of sweets regarding their nutritional values. Since there are only a fairly small amount of items, they can be examined manually to an extent. We used that to determine what types of candy are in interesting positions of the graphs.
The graphs for sugar and carbohydrates are very similar. We divided the data into three separate sets: The spike in the middle (~45 - 55), the top and the bottom. The many items in the middle seem to be mostly milk-chocolate and other chocolate based sweets, the bottom ones tend to be cream, fruit based sweets and dark chocolate, while the top items seem to contain more jelly-based candy. This is the case for both carbohydrates and sugar.
We inspected the differences between the sugar and the carbohydrate histogram and found that around 80g there were more items with that level of carbohydrates than sugars and around 25g there were more foods with that level of sugars than carbohydrates. The items with ~80g carbohydrates but different sugar values are jelly-based candy with a sugar level of around 40-55g, while the foods with ~25g sugars but different carbohydrate values are sweets that contained dough like chocolate cakes with a carbohydrate level of around 35-40.
The spike between 30g and 40g of fat contains many items containing nuts (Hazelnuts, almonds, coconut) and cookies like Oreo and cookie-chocolate. The spike at less than 5g contains jelly based sweets and mints like TicTacs.
1.2 Make a wordcloud of 'packaging' attributes
The graph of wordcloud shows that most of the food use plastic packaging, carton, glass. But we need to do some fixing the values (translating to English word).
1.3 Sugar_100g distribution based on Countries (only row 1 - 5000)
The graph shows that sugar_100g in food based on countries which the food were bought, from that we tried to find the maximum of sugar contains in food for each country and do some bar plot also dividing it into developed countries and developing countries.
1.3.1 Highest sugar per 100g in food based on developing countries (only rows 1 - 5000)
The graph shows that Turkey has the highest sugar in food.
1.3.2 highest sugar per 100g in food based on developed countries (only rows 1 - 5000)
The graph shows that Australia has the highest sugar in food.
2 - Future Work
- Find more candy using the information we gained by inspecting the tagged items, possibly using machine learning
- Do what we did with candy for other types of food like bakery items or meat
Presentation
This week presentation can be found here