Exposure ‐ data - openefsa/foodex2-sca-backend GitHub Wiki

European Food Safety Authority

This page describes the data used for training the FoodEx2 Exposure models.

Training dataset

Once the source dataset is constructed, various Natural Language Processing (NLP) functions are applied to each record. The list below shows the most important methods applied:

  • Normalization and lower case,
  • Verify the correctness of the FoodEx2 codes (simple and compound),
  • Clean special characters and punctuation unnecessary for the analysis,
  • Remove duplicate words,
  • Remove custom sentences (e.g. sentences introduced by data collection tools),
  • Remove custom stop words,
  • Word lemmatization (remove tense and plural forms).

Once the original dataset has been cleaned all duplicate records are removed. This is the final training dataset used for training the various models. The following table summarise information about the training dataset:

  • version : training dataset and models version,
  • foodEx2 domain : FoodEx2 hierarchies to which the collected data belong,
  • original features : number of FoodEx2 records extracted from DWH,
  • cleaned features : number of FoodEx2 records after appling NLP functions listed above.
version foodex2 domain original features cleaned features
v.11.1.0 exposure n.a. n.a.
v.11.2.0 exposure 374410 302265
v.11.3.0 exposure 416290 388371
v.11.3.1 exposure 510736 464175

Training dataset per model

The following table summarise information about the training dataset for each model:

  • name : model's name,
  • version : version of the training data and model,
  • training features : number of records used for training the model,
  • classes : number of FoodEx2 terms (also called classes) covered in the domain model (the total number of terms is extracted from latest catalogue version where the reportable flag is set to 1).

Click on the short name link to see the plot of the data distribution for that specific model.

short name name version training features covered classes total classes
BT Base term 11.1.0
11.2.0
11.3.0 (online)
11.3.1
222329
223481
284882
219355
2325
2468
2664
2584
4367
CAT Facet category 11.1.0
11.2.0
11.3.0 (online)
11.3.1
348286
352738
451196
352054
26
28
28
28
28
F01 Source 11.1.0
11.2.0
11.3.0 (online)
11.3.1
1796
2417
2832
2058
172
209
280
234
19744
F02 Part-nature 11.1.0
11.2.0
11.3.0 (online)
11.3.1
3056
3600
3323
1711
57
69
90
73
513
F03 Physical-state 11.1.0
11.2.0
11.3.0 (online)
11.3.1
23915
24126
27088
22024
22
22
22
23
30
F04 Ingredient 11.1.0
11.2.0
11.3.0 (online)
11.3.1
116082
116795
131954
111826
1432
1520
1623
1682
4496
F06 Surrounding-medium 11.1.0
11.2.0
11.3.0 (online)
11.3.1
4144
4151
5361
4000
25
25
26
27
27
F07 Fat-content 11.1.0
11.2.0
11.3.0 (online)
11.3.1
6436
6454
6721
6203
119
123
127
138
191
F08 Sweetening-agent 11.1.0
11.2.0
11.3.0 (online)
11.3.1
2947
2988
3841
4833
26
27
30
42
56
F09 Fortification-agent 11.1.0
11.2.0
11.3.0 (online)
11.3.1
9486
9508
10176
9187
45
46
46
47
51
F10 Qualitative-info 11.1.0
11.2.0
11.3.0 (online)
11.3.1
52089
52281
59313
44618
70
70
72
74
81
F11 Alcohol-content 11.1.0
11.2.0
11.3.0 (online)
11.3.1
467
473
520
313
36
38
42
34
191
F12 Dough-Mass 11.1.0
11.2.0
11.3.0 (online)
11.3.1
282
282
330
231
12
12
14
14
33
F17 Extent-of-cooking 11.1.0
11.2.0
11.3.0 (online)
11.3.1
12859
12861
12736
3097
14
14
14
15
15
F18 Packaging-format 11.1.0
11.2.0
11.3.0 (online)
11.3.1
5914
5930
9422
13355
13
13
14
17
17
F19 Packaging-material 11.1.0
11.2.0
11.3.0 (online)
11.3.1
28549
28678
46536
49383
25
25
28
29
34
F20 Part-consumed-analysed 11.1.0
11.2.0
11.3.0 (online)
11.3.1
10963
11005
12149
7367
32
34
38
38
55
F21 Production-method 11.1.0
11.2.0
11.3.0 (online)
11.3.1
6418
6457
22305
20426
11
12
17
20
23
F22 Preparation-production-place 11.1.0
11.2.0
11.3.0 (online)
11.3.1
18662
18669
22204
21333
9
10
11
17
19
F23 Target-consumer 11.1.0
11.2.0
11.3.0 (online)
11.3.1
16687
16756
19595
17776
13
13
15
15
63
F24 Intended-use 11.1.0
11.2.0
11.3.0 (online)
11.3.1
2073
2075
5392
3260
8
8
8
8
11
F25 Risky-Ingredient 11.1.0
11.2.0
11.3.0 (online)
11.3.1
1317
1317
1398
1074
2
2
2
3
6
F26 Generic-term 11.1.0
11.2.0
11.3.0 (online)
11.3.1
17547
18030
25868
19900
2
2
2
2
2
F27 Source-commodities 11.1.0
11.2.0
11.3.0 (online)
11.3.1
13707
13846
17182
12605
361
400
479
462
2482
F28 Process 11.1.0
11.2.0
11.3.0 (online)
11.3.1
118745
120053
143254
137889
130
133
149
162
196
F29 Purpose-of-raising 11.1.0
11.2.0
11.3.0 (online)
11.3.1
0
0
34
23
0
0
4
4
22
F30 Reproductive-level 11.1.0
11.2.0
11.3.0 (online)
11.3.1
0
0
11
11
0
0
2
2
4
F31 Animal-age-class 11.1.0
11.2.0
11.3.0 (online)
11.3.1
957
1338
1339
1219
7
7
10
10
46
F32 Gender 11.1.0
11.2.0
11.3.0 (online)
11.3.1
1091
1484
1789
1653
3
3
5
5
7
F33 Legislative-classes 11.1.0
11.2.0
11.3.0 (online)
11.3.1
2710
2786
8869
7405
44
48
128
126
260

Next

Previous

⚠️ **GitHub.com Fallback** ⚠️