finnpos–train - mpsilfve/FinnPos GitHub Wiki

Usage

finnpos-train config_file train_file dev_file model_output_file

Purpose

Train your own morphological taggers and lemmatizers.

Notes

config_file - Used to control the properties of the tagger. Its format is specified below.

train_file - Training file. Used estimate parameters.

dev_file - Development file. Used for early stopping and estimation of other hyper-parameters.

The training file and development file need to conform the the FinnPos data format and you need to do feature extractin before training. If you do not want to write your own feature extraction script, you can use finnpos-ratna-feats.py.

Config file format

Every line in the config file has to contain a valid attribute name and its value separated by "=", e.g.

max_train_passes=50

Additionally config files may contain empty lines and comment lines prefixed with "#".

Valid attributes and their possible values:

estimator. The estimator used to train the CRF model. Currently, only the value AVG_PERC (averaged perceptron) is supported. Default AVG_PERC.
inference. The inference criterion. Currently, only the value MAP (maximum a postriori assignment) is supported. Default MAP.
suffix_length. The maximal suffix length used in feature extraction during lemmatization. Its values should be a positive integer. Default 10.
degree. The degree of structured features used by the tagger. Currently, the only value supported is 2. Default 2.
max_train_passes. The maximal number of training passes during estimation of tagger parameters. The values should be a non-negative integer. Default 50.
max_lemmatizer_passes. The maximal number of training passes during estimation of lemmatizer parameters. The values should be a non-negative integer. Default 50.
max_useless_passes. Determined the maximal number of passes over the training data that do not improve the tagging accuracy for the development data. Used in early stopping of tagger and lemmatizer estimation. Default 3.
guess_mass. A generative label guesser is used to prune the label candidates considered for each word during training. guess_mass is a float in range (0, 1.0) which detemines the mass preserved by the guesser for each word. Default 0.99.
beam_mass. FinnPos uses an adaptive beam to prune search histories during beam search. This parameter determines the probability mass of the beam. Value in the range (0, 1.0). Default 0.99.
beam. Use a fixed beam width instead of an adaptive beam. The value should be a positive integer.
use_label_dictionary. Whether to limit the labels of a word to the labels seen in the training data. Not relevant during training.
guess_count_limit. A hard limit on the number of label guesses a word can receive. This speeds up tagging but accuracy may degrade slightly.
regularization. Unimplemented.
delta. Unimplemented.
sigma. Unimplemented.

Example

A configuration file

# Config file for FinnTreeBank tagger.
guess_mass=0.999
beam_mass=0.999
max_train_passes=3
max_lemmatizer_passes=7

Training a model

$ finnpos-train config train_data dev_data model
Reading training data.
Training label guesser.
Reading development data.
Setting label guesses.
Estimating lemmatizer parameters.
  Train pass 1:
    Dev acc: 96.7268%
  Train pass 2:
    Dev acc: 97.2469%
  Train pass 3:
    Dev acc: 97.3336%
  Train pass 4:
    Dev acc: 97.4409%
  Train pass 5:
    Dev acc: 97.4574%
  Train pass 6:
    Dev acc: 97.4656%
  Train pass 7:
    Dev acc: 97.4739%
  Train pass 8:
    Dev acc: 97.4533%
  Train pass 9:
    Dev acc: 97.4615%
  Train pass 10:
    Dev acc: 97.4698%
  Final dev acc: 97.4739%
Estimating tagger parameters.
  Train pass 1
15296 of 15296
    Dev acc: 92.1768%
  Train pass 2
15296 of 15296
    Dev acc: 92.5689%
  Train pass 3
15296 of 15296
    Dev acc: 92.5991%
  Train pass 4
15296 of 15296
    Dev acc: 92.4724%
  Train pass 5
15296 of 15296
    Dev acc: 92.4121%
  Train pass 6
15296 of 15296
    Dev acc: 92.4121%
  Final dev acc: 92.5991%
Storing model.