Data formats - mpsilfve/FinnPos GitHub Wiki
The utilities finnpos-train
and finnpos-label
read and write input sentences in a five column tab-separated format where each row corresponds to one input word and the columns denote
- Word form (e.g.
Dogs
). - Features separated by spaces (e.g.
WORD=dogs PREV_WORD=the
). - lemma (e.g.
dog
). - Label (e.g.
NNS
,Noun|Plural
). - Annotations (arbitrary text not containing tabulators).
The file should be in utf-8 encoding and none of the fields may include tabs. Consecutive sentences are separated by one or more empty lines.
Example:
The WORD=The LC_WORD=the the DT _ dog WORD=dog LC_WORD=dog dog NN _ barks WORD=barks LC_WORD=barks bark VBZ _ . WORD=. LC_WORD=. . . _ The WORD=The LC_WORD=the the DT _ cat WORD=cat LC_WORD=cat cat NN _ meows WORD=meows LC_WORD=meows meow VBZ _ . WORD=. LC_WORD=. . . _
Formally, all fields have to be non-empty. The underscore (_
) can be used to denote a field with empty value.
When using a file as training or development file for finnpos-train
, the lemma and label fields have to contain exactly one value each (i.e. they canot contain spaces and they cannot be _
).
If the label field is non-empty in the input for finnpos-label
, the tagger will disambiguate between the candidates provided.
Both input files for finnpos-eval
need to have unique lemmas and labels for each word.
Sub-labels of complex labels should be separated using |
. Example:
The WORD=The LC_WORD=the the Det _ dog WORD=dog LC_WORD=dog dog Noun|Singular _ barks WORD=barks LC_WORD=barks bark Verb|Pres|3|Singular _ . WORD=. LC_WORD=. . . _ The WORD=The LC_WORD=the the Det _ cat WORD=cat LC_WORD=cat cat Noun|Singular _ meows WORD=meows LC_WORD=meows meow Verb|Pres|3|Singular _ . WORD=. LC_WORD=. . . _
An example of a training data file with unique lemma and label for each word:
The WORD=The LC_WORD=the the DT _ dog WORD=dog LC_WORD=dog dog NN _ barks WORD=barks LC_WORD=barks bark VBZ _ . WORD=. LC_WORD=. . . _ The WORD=The LC_WORD=the the DT _ cat WORD=cat LC_WORD=cat cat NN _ meows WORD=meows LC_WORD=meows meow VBZ _ . WORD=. LC_WORD=. . . _
An example of a test data file:
The WORD=The LC_WORD=the _ _ _ dog WORD=dog LC_WORD=dog _ _ _ barks WORD=barks LC_WORD=barks _ _ _ . WORD=. LC_WORD=. _ _ _ The WORD=The LC_WORD=the _ _ _ cat WORD=cat LC_WORD=cat _ _ _ meows WORD=meows LC_WORD=meows _ _ _ . WORD=. LC_WORD=. _ _ _
An example of a test data file with label candidates:
The WORD=The LC_WORD=the LABEL:DT _ DT _ dog WORD=dog LC_WORD=dog LABEL:NN LABEL:VBN _ NN VBN _ barks WORD=barks LC_WORD=barks LABEL:VBZ LABEL:NNS _ VBZ NNS _ . WORD=. LC_WORD=. LABEL:. _ . _ The WORD=The LC_WORD=the LABEL:DT _ DT _ cat WORD=cat LC_WORD=cat LABEL:NN _ NN _ meows WORD=meows LC_WORD=meows LABEL:VBZ LABEL:NNS _ VBZ NNS _ . WORD=. LC_WORD=. LABEL:. _ . _
Each word will receive one of its label candidates. This file additionally uses morphological labels as features.