Exposure ‐ data - openefsa/foodex2-sca-backend GitHub Wiki
This page describes the data used for training the FoodEx2 Exposure models.
Once the source dataset is constructed, various Natural Language Processing (NLP) functions are applied to each record. The list below shows the most important methods applied:
- Normalization and lower case,
- Verify the correctness of the FoodEx2 codes (simple and compound),
- Clean special characters and punctuation unnecessary for the analysis,
- Remove duplicate words,
- Remove custom sentences (e.g. sentences introduced by data collection tools),
- Remove custom stop words,
- Word lemmatization (remove tense and plural forms).
Once the original dataset has been cleaned all duplicate records are removed. This is the final training dataset used for training the various models. The following table summarise information about the training dataset:
- version : training dataset and models version,
- foodEx2 domain : FoodEx2 hierarchies to which the collected data belong,
- original features : number of FoodEx2 records extracted from DWH,
- cleaned features : number of FoodEx2 records after appling NLP functions listed above.
version | foodex2 domain | original features | cleaned features |
---|---|---|---|
v.11.1.0 | exposure | n.a. | n.a. |
v.11.2.0 | exposure | 374410 | 302265 |
v.11.3.0 | exposure | 416290 | 388371 |
v.11.3.1 | exposure | 510736 | 464175 |
The following table summarise information about the training dataset for each model:
- name : model's name,
- version : version of the training data and model,
- training features : number of records used for training the model,
- classes : number of FoodEx2 terms (also called classes) covered in the domain model (the total number of terms is extracted from latest catalogue version where the reportable flag is set to 1).
Click on the short name link to see the plot of the data distribution for that specific model.
short name | name | version | training features | covered classes | total classes |
---|---|---|---|---|---|
BT | Base term | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
222329 223481 284882 219355 |
2325 2468 2664 2584 |
4367 |
CAT | Facet category | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
348286 352738 451196 352054 |
26 28 28 28 |
28 |
F01 | Source | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
1796 2417 2832 2058 |
172 209 280 234 |
19744 |
F02 | Part-nature | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
3056 3600 3323 1711 |
57 69 90 73 |
513 |
F03 | Physical-state | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
23915 24126 27088 22024 |
22 22 22 23 |
30 |
F04 | Ingredient | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
116082 116795 131954 111826 |
1432 1520 1623 1682 |
4496 |
F06 | Surrounding-medium | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
4144 4151 5361 4000 |
25 25 26 27 |
27 |
F07 | Fat-content | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
6436 6454 6721 6203 |
119 123 127 138 |
191 |
F08 | Sweetening-agent | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
2947 2988 3841 4833 |
26 27 30 42 |
56 |
F09 | Fortification-agent | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
9486 9508 10176 9187 |
45 46 46 47 |
51 |
F10 | Qualitative-info | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
52089 52281 59313 44618 |
70 70 72 74 |
81 |
F11 | Alcohol-content | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
467 473 520 313 |
36 38 42 34 |
191 |
F12 | Dough-Mass | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
282 282 330 231 |
12 12 14 14 |
33 |
F17 | Extent-of-cooking | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
12859 12861 12736 3097 |
14 14 14 15 |
15 |
F18 | Packaging-format | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
5914 5930 9422 13355 |
13 13 14 17 |
17 |
F19 | Packaging-material | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
28549 28678 46536 49383 |
25 25 28 29 |
34 |
F20 | Part-consumed-analysed | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
10963 11005 12149 7367 |
32 34 38 38 |
55 |
F21 | Production-method | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
6418 6457 22305 20426 |
11 12 17 20 |
23 |
F22 | Preparation-production-place | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
18662 18669 22204 21333 |
9 10 11 17 |
19 |
F23 | Target-consumer | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
16687 16756 19595 17776 |
13 13 15 15 |
63 |
F24 | Intended-use | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
2073 2075 5392 3260 |
8 8 8 8 |
11 |
F25 | Risky-Ingredient | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
1317 1317 1398 1074 |
2 2 2 3 |
6 |
F26 | Generic-term | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
17547 18030 25868 19900 |
2 2 2 2 |
2 |
F27 | Source-commodities | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
13707 13846 17182 12605 |
361 400 479 462 |
2482 |
F28 | Process | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
118745 120053 143254 137889 |
130 133 149 162 |
196 |
F29 | Purpose-of-raising | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
0 0 34 23 |
0 0 4 4 |
22 |
F30 | Reproductive-level | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
0 0 11 11 |
0 0 2 2 |
4 |
F31 | Animal-age-class | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
957 1338 1339 1219 |
7 7 10 10 |
46 |
F32 | Gender | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
1091 1484 1789 1653 |
3 3 5 5 |
7 |
F33 | Legislative-classes | 11.1.0 11.2.0 11.3.0 (online) 11.3.1 |
2710 2786 8869 7405 |
44 48 128 126 |
260 |