Training your own models - mpsilfve/FinnPos GitHub Wiki

Training a tagger

In order to train a tagger, you need annotated training and development data. You can start with development and training data in a format where each line contains a word form, lemma and label separated by tabs and consequtive sentences are separated by newlines. E.g.

The    the    DT
dog    dog    NN
barks  bark   VBZ
.      .      .

The    the    DT
cat    cat    NN
meows  meow   VBZ
.      .      .

In addition to the training and development data files, you need a list of frequent word forms in the training data. The defition of "frequent" could be e.g. all word forms occurring at least 10 times. The default feature extraction script finnpos-ratna-feats.py only extracts orthographic features such as word suffixes and capitalization information for rare words.

Once you have your list of frequent word forms, the training and development data can be piped through finnpos-ratna-feats.py to extract features. Of course, you may also use a custom made feature extraction script (see finnpos-ratna-feats.py and Feature extraction for details).

$ cat training_data | finnpos-ratna-feats.py frequent_word_list > training_feats

Once you write a configuration file, you can use finnpos-train to train a model.

$ finnpos-train config_file training_feats development_feats your_tagger_file
⚠️ **GitHub.com Fallback** ⚠️