PORTAGE_sharedTrainingLanguageModels - SamuelLarkin/LizzyConversion GitHub Wiki
Up: PortageII / Models Next: OtherModels
'''Note:''' this section of the user manual presents all the different language models you can use, but not how best to use them. The steps required to train PORTAGE shared following our current recommendations are automated in our experimental framework. See tutorial.pdf
in framework
for details.
- Training_anLM_usingSRILM#TraininganLMusingSRILM
- Training_anLM_usingMITLM#TraininganLMusingMITLM
- Training_anLM_usingIRSTLM#TraininganLMusingIRSTLM
- TheBinLM_format#TheBinLMformat
- TheTPLM_format#TheTPLMformat
- DynamicMappingLM#DynamicMappingLM
- OpenVocabularyLM#OpenVocabularyLM
- MixtureModelLM#MixtureModelLM
- New! CoarseLM#CoarseLM
- New! CoarseBiLM#CoarseBiLM
Language models can be trained using any language modeling toolkit that generates Doug Paul's ARPA LM file format. If your licensing requirements permit it, we recommend SRILM, since that's the toolkit that works best for us. If you can't use SRILM, MITLM works very well too. IRSTLM also works reasonably well, but has yielded lower BLEU scores in our experiments. We provide instructions for these three toolkits below, and more complete examples in the experimental framework in framework
. Follow the instructions of your own toolkit if you use a different one.
SRILM's default is for the corpus to be contained in a single text file, so the easiest thing to do is concatenate everything. The format is standard TokenizedText#TokenizedText - ''but without angle-bracket markup''! AlignedText#AlignedText will also work, but will give different models, because multiple sentences can occur on a single line.
Examples: ngram-count -text corpus.en -lm lm.en
ngram-count -interpolate -kndiscount -text corpus.en -lm lm.en
The first example uses default options, thus producing a Good-Turing model; the second uses options suggested by Philipp Koehn. Which is better seems to depend on the corpus. For more information on SRILM, see their own documentation, or do ''man ngram-count'' if you have already installed the package.
If you have sufficient computing resources, especially memory, you might want to use 4- or 5-gram language models instead of the default 3-gram model generated by the example above. Add, e.g., -order 4
to the command above to do so. Using a 4-gram model is highly recommended. Using a 5-gram model tends to be worthwhile mostly with very large corpora (in the billions of words).
To save space, just like with PORTAGE shared programs, simply add the .gz
extension to any input or output filename, and SRILM software will compress/decompress it on the fly. Be warned, however, that the process might crash when it tries to open its output file if you're at the limit of your memory resources; to avoid this problem, make sure you have twice as much swap space as RAM, or produce models uncompressed and compress them manually afterwards.
// If your corpora are very large, see // VeryLargeLanguageModels for // information on pushing the limits of SRILM.
If your licencing requirements don't allow you to use SRILM, you can train your language models with MITLM instead, and get comparable machine translation results at the end.
This command builds a 4-grams LM using MITLM:
estimate-ngram -order 4 -smoothing ModKN -text corpus.en -write-lm lm.en
Please refer to the MITLM manual for more details.
Since PORTAGE shared only reads the ARPA LM text format, and not IRSTLM's quantized LM format, we can't take advantage of all the benefits of the work put into IRSTLM. But this toolkit can also generate a standard ARPA LM file, which can be used in PORTAGE shared.
This procedure builds a 4-grams LM using IRSTLM:
add-start-end.sh < corpus.en > corpus.en.marked
build-lm.sh -p -n 4 -s kneser-ney -i corpus.en.marked -o corpus.en.ilm.gz
compile-lm --text yes corpus.en.ilm.gz lm.en
gzip lm.en
Please refer to the IRSTLM manual for more details.
PORTAGE shared supports a binary format for language models. Models converted to this format can load almost an order of magnitude faster than standard ARPA LM files, so this is a good way to significantly speed up the loading of decoder models. Note that although our Bin LM format might be similar in philosophy to SRILM's, IRSTLM's and Moses's binary LM file formats, it is based directly on our own in-memory structures, and is therefore incompatible with those formats.
To create a Bin LM file from a standard ARPA LM file:
arpalm2binlm lm.en lm.en.binlm.gz
or
arpalm2binlm lm.en.gz lm.en.binlm.gz
All programs in PORTAGE shared which use LM files support this format. They are recognized as such automatically, regardless of file name or extension. And they are written to and read from disk in such a way that you can keep them compressed, as are almost all files in PORTAGE_shared.
PORTAGE shared also supports the Tightly Packed Language Model (TPLM) format everywhere an LM file is accepted.
To convert an LM in ARPA format to the TPLM format:
arpalm2tplm.sh lm.gz lm.tplm
This creates the folder lm.tplm
containing several files that must be kept together. The TPLM should be referred to by the name of this folder in your canoe.ini
or anywhere else you want to use it.
It is possible to train language models that dynamically substitute certain classes of words for others internally before calculating probabilities. This can be a good way to treat entities like numbers, which are not a closed class, and which behave alike. We currently have three kinds of dynamic mapping models implemented: simple number mapping, prefix number mapping and case mapping.
To train a dynamic mapping model, you first apply the mapping to your language model training corpus, and then train a language model in the usual way on the result. In this example, ''apply-map'' stands for the particular map you're using (see below), and we use SRILM's ngram-count but you could use any LM Toolkit to do the equivalent:
''apply-map'' < corpus.en > corpus.mapped
ngram-count -interpolate -kndiscount -text corpus.mapped
-lm lm.en.mapped
The simple number mapping model will replace all numbers (anything containing solely digits and punctuation) with tags in which each digit is replaced by a '@', e.g., 123.45
-> @@@.@@
. map_number
performs this mapping for training:
map_number -input corpus.en -map simpleNumber > corpus.mapped
The prefix number mapping model will replace digits which are prefixes of a word by '@'. map_number
can also perform this mapping for training:
map_number -input corpus.en -map prefixNumber > corpus.mapped
The lowercasing model will replace all upper case letters by their lower case equivalent. It can work with various locales and encodings in order to apply the casemapping correctly.
Any software that performs lowercasing appropriately can perform this mapping for training; PORTAGE shared includes several options: lc-latin1.pl
and lc-utf8.pl
perform lowercasing on iso-8859-1 or utf-8 text, respectively, and utf8_casemap
can perform various casemapping operations on utf8 data, including lowercasing:
lc-latin1.pl < corpus.en.latin1 > corpus.en.latin1.lc
lc-utf8.pl < corpus.en.utf8 > corpus.en.utf8.lc
utf8_casemap -c l < corpus.en.utf8 > corpus.en.utf8.lc
You can use a dynamic mapping LM in any PORTAGE shared program that expects a language model, for example in the [lmodel-file]
section of a canoe.ini
, as the lmfile
argument to the lm_eval
program, etc. Instead of the LM's actual filename, you provide a string that describes the mapping with this syntax: DynMap;<MAP>;<LMFilename>
. For example:
DynMap;simpleNumber;lm.en.mapped
DynMap;prefixNumber;lm.en.mapped
DynMap;lower;lm.en.utf8.lc
DynMap;lower-fr_CA.iso88591;lm.en.latin1.lc
The "lower" mapping assumes utf-8 encoding by default, but the last example above shows how to specify the latin-1 encoding instead, which will work correctly on cp-1252 data as well.
The utf-8 case mapping is done via the ICU library, so that functionality will not work unless you compiled/installed PORTAGE shared with ICU. (The compilation and installation defaults leave out ICU, see instructions in INSTALL
to change this default.)
The utf8_casemap
program also requires ICU, but the lc-utf8.pl
script does not, so you can still lowercase utf-8 data without ICU.
Dynamic mapping models can be tested by calculating their perplexity using the lm_eval
program, e.g.:
lm_eval -q 'DynMap;simpleNumber;lm.en.mapped' test.en
Note that there is nothing special about the LM file created and used here. It is the DynMap;
prefix that triggers the dynamic mapping,
and it is up to the user to ensure that this prefix is appropriate
(i.e., corresponds to the mapping used during training). The LM itself can be any kind of LM that PORTAGE shared supports, in ARPA or binary format, compressed or not. It can also be a mixture model LM, should that actually make sense in your experiments, or even an embedded dynamic mapping LM specification.
An open-vocabulary LM is one that includes an estimated probability for unseen words, i.e., OOV's. In SRILM, the -unk
switch can be used to generate such an LM. In PORTAGE shared, these are now automatically detected and supported everywhere by default. With closed-vocabulary LM's, PORTAGE shared assigns a very small probability to OOV's (ALMOST_0
), whereas with simple open-voc ones, p(<unk>)
is used as found in the LM itself.
There are also "full" open-vocabulary LM's, which provide not only a unigram probability for <unk>
, but also probabilities for <unk>
in various contexts. These are supported by the LM classes, but are not automatically detected nor used by any program in PORTAGE shared. Currently, if you use one, it will be treated as a simple open-voc LM, i.e., the unigram probability it provides for <unk>
will be used, but any other information it provides about <unk>
will be ignored. Should you need to use these properly, some coding changes will be required. Note that the dynamic mapping capability described in the previous section could also be used to support "full" open-vocabulary models.
A dynamic mixture LM is a linear word-level mixture of regular ngram LM's adapted for translating a specific source text. The training procedure is as follows:
- Split the parallel training corpus into "components", each corresponding to a source/target file pair.
- Train source and target ngram language models for each component, using the standard procedure described above. The models may be in either ARPA format or PORTAGE shared binary format.
- Use the program
mx-calc-distances.sh
to generate component distances for a given source text from source-side language models. For example:
mx-calc-distances.sh -v -d cmpts.src/ -e .lm.gz em components \
srcfile > distances
where the cmpts.src
directory contains source-side component LM's, eg cmpt1.lm.gz
, cmpt2.lm.gz
, cmpt3.lm.gz
; components
is a file that lists the components one per line, eg:
cmpt1
cmpt2
cmpt3
srcfile
is the current source file, and distances
will contain a distance for each component, one per line.
- Convert distances into weights by normalizing, eg:
mx-dist2weights -v normalize distances > weights
- Finally, create the mixture LM by associating each target-side component LM with the corresponding weight:
mx-mix-models.sh -d cmpts.tgt/ -e .binlm.gz mixlm weights \
components srcfile > srcfile.mixlm
where cmpts.tgt
is a directory containing target-side LM's, and the final mixture model srcfile.mixlm
is just a text file associating each of these LM's with a weight, eg:
cmpts.tgt/cmpt1.binlm.gz 0.2
cmpts.tgt/cmpt2.binlm.gz 0.5
cmpts.tgt/cmpt3.binlm.gz 0.3
The mixture model's name should contain srcfile
to distinguish it from models adapted to other source texts. It must have the extension .mixlm
in order to be recognized as a mixture model by the decoder. If it does, it can be used as-is for decoding, provided the paths cmpts.tgt/cmpt*.binlm.gz
are accessible from the location in which the decoder is run. The framelab
experimental framework contains scripts for training mixlm's that ensure such access and that work with multiple source files.
A frequently useful modification to the above procedure is to include the whole parallel training corpus as an additional component.
There are many other ways to use mixture models within PORTAGE shared. See the README
file in src/adaptation
for details (or doc/README.adaptation
), and Foster_andKuhn2007#FosterandKuhn2007 for a high-level description.
The coarse LM is a new model introduced with PORTAGE shared 3.0 which can improve translations by taking into account longer distance information during decoding. Instead of modelling sequences of words, as regular LM''''''s do, they model sequences of word classes. Since word-class sequences are much less sparse than word sequences, we can reasonably train 8-gram coarse LM''''''s and maintain good decoding speed while getting a useful boost in quality (Stewart_et_al2014#Stewartetal2014).
To train coarse LM''''''s, you must be using our framework. They are now enabled by default, with two models being trained: one 200-class coarse LM, and one 800-class coarse LM: we found empirically that combining these two granularities gives the best results.
The coarse Bi''''''LM feature is a language model that takes into account both the source and target language, looking at coarse classes of words instead of individual words, potentially improving translation quality (Stewart_et_al2014#Stewartetal2014). Training of these models requires the framework, as it is fairly complex to do. With PORTAGE shared 3.0, we don't enable them by default, because they are fairly expensive and don't yield enough benefits. But you can still try them by uncommenting the line that read USE_BILM = 1
.
Up: PortageII / Models Next: OtherModels