Week 04 05 (W49 W50 Dec7 Dec14) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary

We looked into ratios between column entries to find numbers that are unique to candy in order to find more candy and looked into specific types of candy to make finding more easier. Find a distribution of Sugar and Energy in beverages and categorizing the beverages, also visualizing it based on the sugar contains in it.

1. Weekly Work

1.1 - Fat to Sugar ratio in Sweets and other foods

We created scatter plots to show the distribution of food we identified as sweets last week on a candy-sugar ratio graph in hope of using this information to find more sweets. Sadly the distribution doesn't show any really obvious patterns we could use and is very similar to the distribution of other foods.

Candy-sugar-fat-ratio Others-sugar-fat-ratio

We tried to find candy-unique patterns by adding more information and looked at the distribution of fat/sugar by salt. (fun facts: the candy-items with the highest fat/sugar ratio are dark chocolate, the candy-item with the highest salt is peanut butter)

sugar/fat salts candy sugar/fat salts non-candy

This provides a little more information: If we look at a food item and find it contains a lot of salt, it is most likely not candy. So while this lets us exclude some items, it doesn't allow us to actually find candy items. Looking at the small area of the graph where we find most candies in the other-foods-graph, we see that there are about 1000 items with "chocolate" in their name. This is relevant, as our tagged candies contain only 661 entries. On the other hand, there are about 5000 items tagged as plant-based beverages, so clearly not all (and not even the majority of) items in this area are candy.

Now we tried the same thing with fat/sugar to energy in J:

fat/sugar-energy non.candy (The outlier value is real)

Better show that on a more usable scale: non_candy_fat/sugar to energy limited candy fat/sugar to energy

Sadly the energy levels are very similar, so there's little information to be gained from this.

So next we tried to isolate types of sweets like pudding and jelly to find patterns we could use to detect more pudding or jelly specifically.

Candy-sugar-fat-images

Looking into the same location in the graph for non-candy using a python script that counts occurrences of words in the food's names and tags separately we found that in the set of all foods chocolate is prevalent in the 30-45 fats and 40-60 sugars range. We also found a fair amount of jelly-based food and bonbons where we found the jelly in the candies graph. Sadly, we found mostly drinks, yogurts, and fruits in the other-foods-graph where we found pudding in the candy graph.

We wanted to use scikit-learn to find new candy entries with machine learning, but haven't quite managed to make it work just yet so this is something we'll look into more next week.

We also created dot plots based on how much energy that it produced based on sugar per 100g in beverages.

sugar vs energy in beverages

The scatter plots show that mostly beverages in dataset contained around 3g to 7g of sugars per 100g and produced 100 to 200 kJ energy. But the weird thing is some of the beverages contained less sugar (around 10g of sugar) but have really high of energy for around 3500 Kj and there is a drink that has a high amount of sugar (around 95g) but has small energy (around 50Kj).

We also created an average of energy in beverages based on sugar contained.

average of energy in beverages based on sugar it contained

The barplot shows the average of energy in beverages based on sugar it contained. We divided the beverages that contained less than equal to 50g of sugar and more than 50g of sugar. The graph shows that beverages that contain more sugar (which is more than 50g of sugar) have more energy in average, for around 1352.576 Kj and beverages that contained less than equal to 50g of sugar have energy for around 343.9722 in average.

Next, we would divide beverages in each category (soda, juices, protein drink, energy drink, etc) to see is this true

x

The graph shows the average sugar contained in beverages

sugar in beverages

It shows that Sirup and Chocolate Drink have the largest amount of sugar contained in the drink. The Diet Coke has the lowest sugar which is 0 followed by Alcoholic Drink.

1.2 Choropleth Map based on categories of food ( Plant & Meat based products )

We planned to group the data based on categories and filter out things which has more than 4000 entries. Out of these, there were two prominent categories which were also interesting to explore. They were

  • Plant based products (36%)
  • Meat based products (10%)

So we plotted this on a world map and represented it using colors ( green for plant-based products and red for meat based products). To remove biasing and have a better understanding of the results, we removed France from the plot and hence percentage reflected on the map has been scaled from the actual which included France. newplot.png newplot (1).png

After plotting, we focussed on Germany for the other analysis since it showed significant data in both the cases. ( 20% for plant-based product and 33% for meat-based product). But I chose to explore more about the meat-based product since the plant-based product has missed most important countries like India should ideally country to more to it.

1.2.1 Germany meat products and its analysis

german-meat-product-carbohydrate-analysis.png

german-meat-product-protein-analysis.png

german-meat-product-fat-analysis.png

german-meat-product-sugar-analysis.png

download.png download (1).png download (2).png download (4).png

2. Plans for next week

  • Get machine learning with scikit-learn to work and use it to train predictors on candy in general and/or specific types of candy like chocolate to find more entries of that kind and verify the accuracy of tags.
  • To see nutrition in beverages and probably decide which beverages is the healthiest.
  • Finding correlation between cancer data and our nutrients, packaging data to derive a relation between them.
  • Using this correlation for machine learning techniques for predicting cancer rate in different countries based on their food intake.