Training your own models - mpsilfve/FinnPos GitHub Wiki
In order to train a tagger, you need annotated training and development data. You can start with development and training data in a format where each line contains a word form, lemma and label separated by tabs and consequtive sentences are separated by newlines. E.g.
The the DT dog dog NN barks bark VBZ . . . The the DT cat cat NN meows meow VBZ . . .
In addition to the training and development data files, you need a list of frequent word forms in the training data. The defition of "frequent" could be e.g. all word forms occurring at least 10 times. The default feature extraction script finnpos-ratna-feats.py
only extracts orthographic features such as word suffixes and capitalization information for rare words.
Once you have your list of frequent word forms, the training and development data can be piped through finnpos-ratna-feats.py
to extract features. Of course, you may also use a custom made feature extraction script (see finnpos-ratna-feats.py
and Feature extraction for details).
$ cat training_data | finnpos-ratna-feats.py frequent_word_list > training_feats
Once you write a configuration file, you can use finnpos-train
to train a model.
$ finnpos-train config_file training_feats development_feats your_tagger_file