Data formats - mpsilfve/FinnPos GitHub Wiki

Description

The utilities finnpos-train and finnpos-label read and write input sentences in a five column tab-separated format where each row corresponds to one input word and the columns denote

  1. Word form (e.g. Dogs).
  2. Features separated by spaces (e.g. WORD=dogs PREV_WORD=the).
  3. lemma (e.g. dog).
  4. Label (e.g. NNS, Noun|Plural).
  5. Annotations (arbitrary text not containing tabulators).

The file should be in utf-8 encoding and none of the fields may include tabs. Consecutive sentences are separated by one or more empty lines.

Example:

The    WORD=The LC_WORD=the       the  DT    _
dog    WORD=dog LC_WORD=dog       dog  NN    _
barks  WORD=barks LC_WORD=barks   bark VBZ   _
.      WORD=. LC_WORD=.           .    .     _

The    WORD=The LC_WORD=the       the  DT    _
cat    WORD=cat LC_WORD=cat       cat  NN    _
meows  WORD=meows LC_WORD=meows   meow VBZ   _
.      WORD=. LC_WORD=.           .    .     _

Formally, all fields have to be non-empty. The underscore (_) can be used to denote a field with empty value.

When using a file as training or development file for finnpos-train, the lemma and label fields have to contain exactly one value each (i.e. they canot contain spaces and they cannot be _).

If the label field is non-empty in the input for finnpos-label, the tagger will disambiguate between the candidates provided.

Both input files for finnpos-eval need to have unique lemmas and labels for each word.

Sub-labels of complex labels should be separated using |. Example:

The    WORD=The LC_WORD=the       the  Det    _
dog    WORD=dog LC_WORD=dog       dog  Noun|Singular    _
barks  WORD=barks LC_WORD=barks   bark Verb|Pres|3|Singular   _
.      WORD=. LC_WORD=.           .    .     _

The    WORD=The LC_WORD=the       the  Det    _
cat    WORD=cat LC_WORD=cat       cat  Noun|Singular    _
meows  WORD=meows LC_WORD=meows   meow Verb|Pres|3|Singular   _
.      WORD=. LC_WORD=.           .    .     _

More examples

An example of a training data file with unique lemma and label for each word:

The    WORD=The LC_WORD=the       the  DT    _
dog    WORD=dog LC_WORD=dog       dog  NN    _
barks  WORD=barks LC_WORD=barks   bark VBZ   _
.      WORD=. LC_WORD=.           .    .     _

The    WORD=The LC_WORD=the       the  DT    _
cat    WORD=cat LC_WORD=cat       cat  NN    _
meows  WORD=meows LC_WORD=meows   meow VBZ   _
.      WORD=. LC_WORD=.           .    .     _

An example of a test data file:

The    WORD=The LC_WORD=the       _    _    _
dog    WORD=dog LC_WORD=dog       _    _    _
barks  WORD=barks LC_WORD=barks   _    _    _
.      WORD=. LC_WORD=.           _    _    _

The    WORD=The LC_WORD=the       _    _    _
cat    WORD=cat LC_WORD=cat       _    _    _
meows  WORD=meows LC_WORD=meows   _    _    _
.      WORD=. LC_WORD=.           _    _    _

An example of a test data file with label candidates:

The    WORD=The LC_WORD=the LABEL:DT                 _  DT       _
dog    WORD=dog LC_WORD=dog LABEL:NN LABEL:VBN       _  NN VBN   _
barks  WORD=barks LC_WORD=barks LABEL:VBZ LABEL:NNS  _  VBZ NNS  _
.      WORD=. LC_WORD=. LABEL:.                      _  .        _

The    WORD=The LC_WORD=the LABEL:DT                 _  DT        _
cat    WORD=cat LC_WORD=cat LABEL:NN                 _  NN        _
meows  WORD=meows LC_WORD=meows LABEL:VBZ LABEL:NNS  _  VBZ NNS   _
.      WORD=. LC_WORD=. LABEL:.                      _  .         _

Each word will receive one of its label candidates. This file additionally uses morphological labels as features.

⚠️ **GitHub.com Fallback** ⚠️