Feature extraction - mpsilfve/FinnPos GitHub Wiki

General

FinnPos extracts so called unstructured features such as word forms, word suffixes and other orthographic features from the input sentence and uses those to label the input. Additionally, it uses structured features, which model chains of morphological labels.

Users are free to define their own unstructured features or use a pre-defined feature set. The predefined features are extracted using a Python 3 script finnpos-ratna-feat.py. You can pipe text through the utility and it will write the output in the FinnPos five column data format.

If you want to define a custom feature extraction script, it may be a good idea to make your own copy of finnpos-ratna-feats.py and modify it. The script is located in FinnPos/bin/.

The output of all feature extractors should conform to the FinnPos data format.

Feature extraction for tagger input

The output of a feature extraction script for test data may have empty label and lemma fields. If the label field is not empty (i.e. '_') but instead contains a number of label candidates separated by spaces, finnpos-label will use the tagger model to disambiguate between the candidates.

$ cat input
The
dog
barks
.

The
cat
meows
.

$ cat input | ./mini-feat-extractor.py
The    WORD=The LC_WORD=the       _    _    _
dog    WORD=dog LC_WORD=dog       _    _    _
barks  WORD=barks LC_WORD=barks   _    _    _
.      WORD=. LC_WORD=.           _    _    _

The    WORD=The LC_WORD=the       _    _    _
cat    WORD=cat LC_WORD=cat       _    _    _
meows  WORD=meows LC_WORD=meows   _    _    _
.      WORD=. LC_WORD=.           _    _    _

Feature extraction for training and development data

Training and development data for finnpos-train has to contain unique lemmas and labels for each word form.

$ cat input
The    the    DT
dog    dog    NN
barks  bark   VBZ
.      .      .

The    the    DT
cat    cat    NN
meows  meow   VBZ
.      .      .

$ cat input | ./mini-feat-extractor.py
The    WORD=The LC_WORD=the      the   DT    _
dog    WORD=dog LC_WORD=dog      dog   NN    _
barks  WORD=barks LC_WORD=barks  bark  VBZ   _
.      WORD=. LC_WORD=.          .    .      _

The    WORD=The LC_WORD=the      the   DT    _
cat    WORD=cat LC_WORD=cat      cat   NN    _
meows  WORD=meows LC_WORD=meows  meow  VBZ   _
.      WORD=. LC_WORD=.          .     .     _

Adding custom features

Sometimes it may be a good idea to supplement the feature set of finnpos-ratna-feats.py with additional features. For example when using a morphological analyzer, you might want to add morphological analyses as features.

You can add custom features before finnpos-ratna-feats.py. Your training file needs to conform to the FinnPos five column data format. For example

The    _              the    DT      _
dog    IS_ANIMAL      dog    NN      _
barks  ANIMAL_SOUND   bark   VBZ     _
.      _              .      .       _

The    _              the    DT      _
cat    IS_ANIMAL      cat    NN      _
meows  ANIMAL_SOUND   meow   VBZ     _
.      _              .      .       _
⚠️ **GitHub.com Fallback** ⚠️