PORTAGE_sharedTranslatingPostprocessing - SamuelLarkin/LizzyConversion GitHub Wiki

Up PortageII / Translating Previous: RescoringNbestLists Next: DecoderAlgorithms

Translating: Postprocessing

Truecasing

Text output from canoe is lowercase tokenized_text#TokenizedText. The first step in making this nicer to read is truecasing, which tries to restore capitals where appropriate. Assuming language and mapping models tc-lm.en and tc-map.en trained as described in TrueCaseModels#TrueCaseModels, truecasing can be carried out as follows:

truecase.pl -bos -encoding UTF-8 -text=text_en.out \
   -lm=tc-lm.en -map=tc-map.en > text_en.tc

The -bos option forces capitalization of the first letter of each sentence.

The new truecasing method, using source sentence information, requires several other options and models. See truecase.pl -h, the framework and the tutorial for details.

Detokenizing

The second and last step in the processing chain is detokenization, to convert truecased tokenized_text#TokenizedText back to plain_text#PlainText (but preserving the one-sentence-per-line convention). Example for English encoded in utf-8:

udetokenize.pl text_en.out > text_en.txt
udetokenize.pl text_en.tc > text_en.tc.txt

For French, add the option -lang=fr.

Up PortageII / Translating Previous: RescoringNbestLists Next: DecoderAlgorithms