Tagging and Lemmatization - mpsilfve/FinnPos GitHub Wiki

Tagging without a morphological analyzer

You can start out with a tokenized file where each line contains one word form and sentences are separated by newlines. For example

A
dog
barks
.

A
cat
meows
.

Pipe the input through the same feature extraction script you used to process the training and development data (for example finnpos-ratna-feats.py). You can then pipe the output to finnpos-label.

$ cat tokenized_input | finnpos-ratna-feats.py freqent_word | finnpos-label model_file > tagged_input

The tagged text will be in FinnPos data format

Tagging with a morphological analyzer

You can use a morphological tagger for diambiguation, that is, the tagger will choose the correct analysis from a set of analyses returned by a morphological analyzer for each word. You can also use a morphological analyzer as a soft constrain. In this case the morphological labels given by the analyzer are used as features. If your analyzer is good, you should probably use voth soft and hard constraints.

Apply your morphological analyzer before feature extraction. Start with input data

A
dog
barks
.

Add labels from a morphological analyzer. You can add them both as label alternatives and features (morphological labels will only be useful as features if the training data also contained label features):

A      FEAT:DT            _   DT       _
dog    FEAT:NN FEAT:VBN   _   NN VBN   _
barks  FEAT:NNS FEAT:VBZ  _   NNS VBZ  _
.      FEAT:.             _   .        _

Pipe the result through your feature extraction script and then through finnpos-label.

The script ftb-label uses a morphological analyzer. You can modify it to utilize a custom analyzer instead. The script is located in FinnPos/bin/.

⚠️ **GitHub.com Fallback** ⚠️