finnpos–ratna–feats.py - mpsilfve/FinnPos GitHub Wiki
cat input_data | finnpos-ratna-feats.py frequent_word_forms
Extract features from tagger input and training and development data.
frequent_word_forms
- A list of frequent word forms in the training file (for example all words occurring more than 10 times). One word per line.
The file input_data
may consist of lines with one, three or five tab separated fields. Sentences should be separated by empty lines.
If input_file
has one field per line, it is assumed to consist of tokeized plain text. For example
The dog barks . The cat meows .
If input_file
contains lines with three fields, the fields denote word form, lemma and label respectively
The the DT dog dog NN barks bark VBZ . . . The the DT cat cat NN meows meow VBZ . . .
If input_file
contains lines with five fields, the fields denote word form, features, lemma, label(s) and annotations and the file should conform to the FinnPos data format. finnpos-ratna-feats.py
preserves the features in the input file and adds its own features. All other fields remain unaltered.
The _ the DT _ dog IS_ANIMAL dog NN _ barks ANIMAL_SOUND bark VBZ _ . _ . . _ The _ the DT _ cat IS_ANIMAL cat NN _ meows ANIMAL_SOUND meow VBZ _ . _ . . _