Week 02 (W47 Nov23) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary

During this week we focused on cleaning up the dataset and gaining a more in-depth understanding by looking at relationships between attributes and taking a closer look at outliers. Even though we made good progress on the cleanup, there is still more work to do. We also created a python script that allows us to create wordclouds and bar charts more easily, including the ability to select multiple intervals (e.g. creating a separate wordcloud from the brand-names of all food items that contain 0-1000, 1001-2000, 2001-max kJ of energy per 100g respectively).
We also found some of the important attributes and not important attributes in the dataset for us to work on.

  • Important attributes:

  • countries

  • fat_100g

  • allergens

  • sugar_100g

  • carbohydrates_100g

  • energy_100g

  • serving_size

  • additives

  • packaging

  • sodium_100g

  • Not important attributes:

  • code

  • url

  • creator

  • nutrition_grade_uk

  • ingredients_from_palm_oil

  • X.butyric.acid_100g

  • X.sucrose_100g
    We decided those attributes to be not important because those attribute have missing values between 90 to 100 %

1. Weekly Work

1.1 - Dataset cleanup

We managed to remove commata from the end of entries where they didn't belong. For example in column "countries" the appropriate value content is "en:2_letter_country_name" but most of the cells value is country name (ex: Germany, Australia, United Kingdom, UK, US, etc.) and wrote in other language (not in English), such as Chinese, Deutsch, Arabic, etc., but we managed to fix the inappropriate cells content to an appropriate one.

  • First of all, we send the output of "countries" column to a file with a condition not printing the appropriate value that contains "en:", code:
    sink("countries.txt") WorldFood$countries[grep("^[^en:]", WorldFood$countries)] sink()
  1. Then, search the 2 letter country name (ex: DE for Germany, etc)
  2. Replace the inappropriate values to an appropriate values, code:
    dataframe$attribute[dataframe$attribute == "France"] <- "en:FR"
  3. Output countries column before replacement: Before Replacement
  4. Output countries column after replacement:
    After replacement
    We also managed to remove words stopword(English), punctuation in "allergens" attribute.
  5. First, load library NLP (Natural Language Processing), tm (text mining), snowballC, wordcloud, code:
    library(NLP) library(tm) library(SnowballC) library(wordcloud)
  6. Output wordcloud of "allergens" after cleaning:
    Wordcloud allergens

The dataset still requires more cleanup. Some attributes contain entries with different units (e.g. "20g", "23 mL", "1 L") that should be converted into one single unit. Some of these entries contain more complex values like "250 ml (Ce produit permet de préparer 3 portions) 50 g soit 750 mL de potage". In addition, some information is written in different languages (e.g. Plastik and plastic or Milk and Lait) or spelled differently; some entries use accents while others don't, like laît vs lait.

1.2 - Relationships between Attributes

Each bar in the following graphs represents the average value of 1/20th of the dataset, ordered from low to high on the x-axis. (So the left-most bar represents the average y-axis value of the 5% of the dataset with the lowest x-axis value)

Unsurprisingly, the amount of sugars in a food correlates with the amount of carbohydrates. sugars by carbohydrates

More of a surprise: the amount of sugars does not correlate with the amount of energy in 100g of a food item. energy by sugars

On the flip side, fat and energy seem to correlate in general.. fat by energy

..but very much so for items with an energy of 0 - 1000 kJ per 100g fat_by_enegy 0-1000

Foods with high sugar content seem to have a low serving size (in gram). This could be due to the fact that sugary items tend to be small snacks and candy. This could be worth looking into for classifying food as candy based on the available data. sugar by serving size

On the other hand, energy doesn't seem to correlate much with serving size. energy by serving size

1.3 Relationship between Allergens and Additives

The relationship between food additives and allergens was explored. Some of the countries like Indonesia,Luxembourg, Saudi Arabia has more mean additive than allergens, the reason behind it being fewer entries and one of the entries with much higher additive than the allergen. Remaining countries (more than 70%) have more mean allergen than mean additive suggesting that more additives might cause more allergen for the food. The total number of countries which was mean additive and mean allergen were 66 in the total of which 50 were plotted for better representation.

The main reason behind exploring these attributes were to find the relationship between them and further explore the other attributes contributing to these attributes such as packaging and attributes. Packaging might cause food to be preserved using additives and in turn, might create allergens

![Allergens vs Additves](http://server.myspace-shack.com/d22/screenshot from 20161123 07095797840.png)

1.4 Histogram

frequency fat_100g value appear
The graph shows the frequency of value in "fat_100g" appear in dataset, the most value appear in data set is between 0 -5.

1.4 - Outliers

We managed to find outliers in "fat_100g" based on "countries" attribute, as shows in graph below:
fat_100g outliers based on countries
The graph shows the "fat_100g" attribute occurred in each countries, we used only rows 1 - 5000 because if add more rows, the graph will not be feasible due to increasing the x axis and the filling which is shows the information of countries and plot. Graph above shows the maximum, minimum, mean, standard deviation and outlier.

  • Maximum value described as upper horizontal line
  • Minimum value described as lower horizontal line
  • Mean value described as thick horizontal line in the middle
  • Suspected Outlier described as white circle above the maximum value
  • Lower Quartile (Q1) described as lowest point of the median of lower dataset (lowest point of the box)
  • Higher Quartile (Q3) described as higher point of the median of higher dataset (higher point of the box)
  • Interquartile Range (IQR) described as the range of the Q1 to Q3

2 Plans for next week

2.1 Further clean the dataset

  • convert serving sizes into one unit (probably gram since that's what most entries use)
  • Translate multi-language entries, like packaging, to English

2.2 Look into more connections between variables

  • After doing the remaining cleanup look for correlations with the cleaned data, e.g. how packaging correlates with serving size

2.3 Others

  • Use GPS data (from a separate dataset) to visualize information by country on a map
  • Try to classify food items by commonalities that aren't an attribute yet (e.g. 'candy', 'meat product' and others)

Presentation Link

The presentation can be found here