finnpos–ratna–feats.py - mpsilfve/FinnPos GitHub Wiki

Usage

cat input_data | finnpos-ratna-feats.py frequent_word_forms

Purpose

Extract features from tagger input and training and development data.

Notes

frequent_word_forms - A list of frequent word forms in the training file (for example all words occurring more than 10 times). One word per line.

The file input_data may consist of lines with one, three or five tab separated fields. Sentences should be separated by empty lines.

If input_file has one field per line, it is assumed to consist of tokeized plain text. For example

The
dog
barks
.

The
cat
meows
.

If input_file contains lines with three fields, the fields denote word form, lemma and label respectively

The    the    DT
dog    dog    NN
barks  bark   VBZ
.      .      .

The    the    DT
cat    cat    NN
meows  meow   VBZ
.      .      .

If input_file contains lines with five fields, the fields denote word form, features, lemma, label(s) and annotations and the file should conform to the FinnPos data format. finnpos-ratna-feats.py preserves the features in the input file and adds its own features. All other fields remain unaltered.

The    _             the    DT    _
dog    IS_ANIMAL     dog    NN    _
barks  ANIMAL_SOUND  bark   VBZ   _
.      _             .      .     _

The    _             the    DT    _
cat    IS_ANIMAL     cat    NN    _
meows  ANIMAL_SOUND  meow   VBZ   _
.      _             .      .     _
⚠️ **GitHub.com Fallback** ⚠️