Week 01 (W46 Nov16) World Food Facts - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary

During this week, our prime focus was to understand the dataset in great depth. After investigating the dataset one of the straightforward findings was the missing values. Missing values were replaced appropriately so as to continue our data study. We looked for other datasets to fill the missing values and found some, though a detailed investigation is pending.

Important fields, such as nutritional information like lactose per 100g, are empty for a lot of records.We used R to clean up the dataset e.g. by replacing missing values with an n/a value and created some visualizations to get an impression of the data. Our team utilized R, python and Metabase tools to derive visualization of the data.

1 - Weekly Work

1.1 - Dataset organization & description

1.1.1 - Dataset summary

The Open Food facts is a database for food products and are for everyone to use and contribute. It contains information about the origin of the various food items, allergens, ingredients, nutritional content etc. It is developed by a non-profit association of volunteers. This database can help us in planning our diets based on requirements of our body.

1.1.2 - Dataset description

Size: 267 MB
Attributes: 159
Format: csv
Entry-count: 106458
Missing Values: 12818231 cells

1.1.3 - Finding NA (Missing Values) in Dataset

R Code: #find NAs in set which(is.na(food))
Output:

As it can be seen the number in Console shows the cells number that occurred in Dataset.

1.1.4 - Replace NA (Missing Values) with "N/A" value

R Code: #replace NA values with N/A food[is.na(food)] <- "N/A"
Output:
The Dataset after replacement:

1.2 - Visualizations

Nutritional score by brand

Average nutritional score of food products by different brands are fairly similar (score published by the UK Food Standards Administration) Nutritional score by brand

Countries that the food items are from

Country Wordcloud

Allergens in the food

Allergens use different languages, so there is some cleanup to do Allergens Wordcloud

Nutritional score by country

The most as well as the least nutritious item are both from France. More items in the dataset are from France than any other country. Country information is more sparse than brand information. This is likely due to the fact that it is often hard to tell which country a food item is from. Nutritional score by country

Energy in kJ per 100g in food

Most foods contain around 250kJ of energy per 100g, but some outliers reach around 4000kJ

2 - Difficulties

2.1 Technical Difficulties

Visualizations are created with plot.ly and Metabase. Since the original dataset, we worked on a smaller dataset to accelerate our process for visualization.

We used R for visualizing the original datasets. After initial hiccups such as frozen screen and slow system response, we were able to manage visualizations like word clouds, bar-chart and scatter plots.

2.2 Difficulties with the Data

Key nutritional information is missing for many entries. Since nutritional information is among the most interesting attributes for the food we may need to find additional sources of data for optimal results.

That being said, we're confident that we can find interesting results regardless of whether we can locate additional information as we have a decent amount of complete data.

3 - Ideas

Related Variables:

match sugar / salt / energy content with country or brand
match sugar / salt / energy / allergens content with creation date (assuming that recent data - on average - describes more recent food)

Clustering:

Cluster foods around nutritional data
Cluster brands around nutritional data of their foods

Possible Prediction Tasks:

Predict nutritional values of next food product by a brand

Other:

Classify food as candy and other types based on nutritional (and possibly brand) information

3 - Resources

This week's presentation can be found here
Other datasets that can be investigated :
1. United States Department of Agriculture