Feature extraction - mpsilfve/FinnPos GitHub Wiki
FinnPos extracts so called unstructured features such as word forms, word suffixes and other orthographic features from the input sentence and uses those to label the input. Additionally, it uses structured features, which model chains of morphological labels.
Users are free to define their own unstructured features or use a pre-defined feature set. The predefined features are extracted using a Python 3 script finnpos-ratna-feat.py
. You can pipe text through the utility and it will write the output in the FinnPos five column data format.
If you want to define a custom feature extraction script, it may be a good idea to make your own copy of finnpos-ratna-feats.py
and modify it. The script is located in FinnPos/bin/
.
The output of all feature extractors should conform to the FinnPos data format.
The output of a feature extraction script for test data may have empty label and lemma fields. If the label field is not empty (i.e. '_') but instead contains a number of label candidates separated by spaces, finnpos-label
will use the tagger model to disambiguate between the candidates.
$ cat input The dog barks . The cat meows . $ cat input | ./mini-feat-extractor.py The WORD=The LC_WORD=the _ _ _ dog WORD=dog LC_WORD=dog _ _ _ barks WORD=barks LC_WORD=barks _ _ _ . WORD=. LC_WORD=. _ _ _ The WORD=The LC_WORD=the _ _ _ cat WORD=cat LC_WORD=cat _ _ _ meows WORD=meows LC_WORD=meows _ _ _ . WORD=. LC_WORD=. _ _ _
Training and development data for finnpos-train
has to contain unique lemmas and labels for each word form.
$ cat input The the DT dog dog NN barks bark VBZ . . . The the DT cat cat NN meows meow VBZ . . . $ cat input | ./mini-feat-extractor.py The WORD=The LC_WORD=the the DT _ dog WORD=dog LC_WORD=dog dog NN _ barks WORD=barks LC_WORD=barks bark VBZ _ . WORD=. LC_WORD=. . . _ The WORD=The LC_WORD=the the DT _ cat WORD=cat LC_WORD=cat cat NN _ meows WORD=meows LC_WORD=meows meow VBZ _ . WORD=. LC_WORD=. . . _
Sometimes it may be a good idea to supplement the feature set of finnpos-ratna-feats.py
with additional features. For example when using a morphological analyzer, you might want to add morphological analyses as features.
You can add custom features before finnpos-ratna-feats.py
. Your training file needs to conform to the FinnPos five column data format. For example
The _ the DT _ dog IS_ANIMAL dog NN _ barks ANIMAL_SOUND bark VBZ _ . _ . . _ The _ the DT _ cat IS_ANIMAL cat NN _ meows ANIMAL_SOUND meow VBZ _ . _ . . _