Week 01 (W46 Nov16) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
0 - Summary
During this week, our prime focus was to understand the dataset in great depth. After investigating the dataset one of the straightforward findings was the missing values. Missing values were replaced appropriately so as to continue our data study. We looked for other datasets to fill the missing values and found some, though a detailed investigation is pending.
Important fields, such as nutritional information like lactose per 100g
, are empty for a lot of records.We used R to clean up the dataset e.g. by replacing missing values with an n/a
value and created some visualizations to get an impression of the data. Our team utilized R, python and Metabase tools to derive visualization of the data.
1 - Weekly Work
1.1 - Dataset organization & description
1.1.1 - Dataset summary
The Open Food facts is a database for food products and are for everyone to use and contribute. It contains information about the origin of the various food items, allergens, ingredients, nutritional content etc. It is developed by a non-profit association of volunteers. This database can help us in planning our diets based on requirements of our body.
1.1.2 - Dataset description
- Size: 267 MB
- Attributes: 159
- Format: csv
- Entry-count: 106458
- Missing Values: 12818231 cells
1.1.3 - Finding NA (Missing Values) in Dataset
-
R Code:
#find NAs in set which(is.na(food))
-
Output:
As it can be seen the number in Console shows the cells number that occurred in Dataset.
1.1.4 - Replace NA (Missing Values) with "N/A" value
-
R Code:
#replace NA values with N/A food[is.na(food)] <- "N/A"
-
Output:
-
The Dataset after replacement:
1.2 - Visualizations
Nutritional score by brand
Average nutritional score of food products by different brands are fairly similar (score published by the UK Food Standards Administration)
Countries that the food items are from
Allergens in the food
Allergens use different languages, so there is some cleanup to do
Nutritional score by country
The most as well as the least nutritious item are both from France. More items in the dataset are from France than any other country. Country information is more sparse than brand information. This is likely due to the fact that it is often hard to tell which country a food item is from.
Energy in kJ per 100g in food
Most foods contain around 250kJ of energy per 100g, but some outliers reach around 4000kJ
2 - Difficulties
2.1 Technical Difficulties
Visualizations are created with plot.ly
and Metabase
. Since the original dataset, we worked on a smaller dataset to accelerate our process for visualization.
We used R for visualizing the original datasets. After initial hiccups such as frozen screen and slow system response, we were able to manage visualizations like word clouds
, bar-chart
and scatter plots
.
2.2 Difficulties with the Data
Key nutritional information is missing for many entries. Since nutritional information is among the most interesting attributes for the food we may need to find additional sources of data for optimal results.
That being said, we're confident that we can find interesting results regardless of whether we can locate additional information as we have a decent amount of complete data.
3 - Ideas
Related Variables:
- match sugar / salt / energy content with country or brand
- match sugar / salt / energy / allergens content with creation date (assuming that recent data - on average - describes more recent food)
Clustering:
- Cluster foods around nutritional data
- Cluster brands around nutritional data of their foods
Possible Prediction Tasks:
- Predict nutritional values of next food product by a brand
Other:
- Classify food as candy and other types based on nutritional (and possibly brand) information
3 - Resources
- This week's presentation can be found here
- Other datasets that can be investigated :