PORTAGE_sharedAnnotatedBibliography - SamuelLarkin/LizzyConversion GitHub Wiki
Up: PortageII Previous: ProgrammingGuidelines Next: RequirementsForTrainingAMidSizeSystem
- AGoodTextbook_aboutSMT#AGoodTextbookaboutStatisticalMachineTranslationSMT
- Introductory Papers about SMT
- Brown_et_al_1993#Brownetal1993
- Goodman_2001#Goodman2001
- PHARAOHManual#PHARAOHManual
- BLEUPaper#BLEUPaper
- Koehn_et_al_2003#Koehnetal2003
- Marcu_andWong_2002#MarcuandWong2002
- Och_andNey_2004#OchandNey2004
- Och_andNey_2002#OchandNey2002
- Och_2003#Och2003
- OchUeffing_andNey_2001#OchUeffingandNey2001
- UeffingOch_andNey_2002#UeffingOchandNey2002
- Knight_1999#Knight1999
- Germann_et_al_2001#Germannetal2001
- JHUWorkshop_2003#JHUWorkshop2003
- Och_andNey_2003#OchandNey2003
- Specialized Papers about SMT
- Badr_et_al2007#Badretal2007
- Bisazza_et_al2011#Bisazzaetal2011
- Chen_et_al2009#Chenetal2009
- Cherry_andFoster2012#CherryandFoster2012
- Cherry_et_al2012#Cherryetal2012
- Foster_et_al2006#Fosteretal2006
- Foster_andKuhn2007#FosterandKuhn2007
- Foster_andKuhn2009#FosterandKuhn2009
- Fraser_andMarcu2007#FraserandMarcu2007
- Galley_andManning2008#GalleyandManning2008
- Germann_et_al2009#Germannetal2009
- Guzman_et_al2009#Guzmanetal2009
- He2007#He2007
- Huang_andChiang2007#HuangandChiang2007
- Johnson_et_al2007#Johnsonetal2007
- Koehn_et_al2007#Koehnetal2007
- Kuhn_andDeMori1990_2#KuhnandDeMori19902
- Kumar_andByrne2004#KumarandByrne2004
- Liang_et_al2006#Liangetal2006
- Mangu_et_al1999#Manguetal1999
- Niehues_et_al2011#Niehuesetal2011
- Paul_et_al2009#Pauletal2009
- Simard_andIsabelle2009#SimardandIsabelle2009
- Ueffing_andNey2005#UeffingandNey2005
- Zens_andNey2004#ZensandNey2004
- Zens_andNey2006#ZensandNey2006
- Papers about features added in Portage''''II 3.0
A Good Textbook about Statistical Machine Translation (SMT)
Philipp Koehn is one of the most highly respected researchers in SMT. Along with Franz Josef Och, he developed the first phrase-based SMT systems; this phrase-based approach now dominates the field of SMT. Koehn is the author of PHARAOH (phrase-based decoding software - see below under "Introductory Papers about SMT") and the principal contributor to Moses (a widely-used, complete open-source SMT system). More recently, he has worked on factored models for SMT and on tools for human translators.
Along with Koehn's many technical accomplishments, he is an excellent communicator. His book "Statistical Machine Translation" (Cambridge University Press, 2010) demonstrates this. The book is highly recommended to anyone who wants to understand SMT in general or PORTAGE shared in particular. It contains the following chapters:
Chapter 1: Introduction
Describes the history of machine translation from the immediate post-World War II period to the present; discusses applications of machine translation, from rough translations suitable for browsing (gisting) through machine translation as an aid to human translators to the possibly Utopian goal of fully-automatic, high-quality translation; lists resources available for SMT research, including software tools, training data, and test data that can be used to benchmark performance.
Chapter 2: Word, Sentences, Corpora
What is a word? How do we represent the hierarchical structure between words in a sentence? Where can one obtain the parallel corpora required to train an SMT system?
Chapter 3: Probability Theory
Discusses the basic concepts of probability theory required to build probabilistic models of language, such as techniques for estimating a probability distribution from data and joint and conditional distributions.
Chapter 4: Word-Based Models
Describes the work at IBM in the late 1980s and early 1990s that defined several classes of word-based translation models, which were associated with a noisy-channel model of translation. These models are not merely of historical interest - they are still often used in the crucial word alignment step of training a phrase-based system on bilingual data.
Chapter 5: Phrase-Based Models
Most SMT systems in practical use today are phrase-based. In the phrase-based approach, a continuous sequence of N words S in the source language may be associated with a continuous sequence of M words T in the target language. Conditional probabilities P(S|T) and P(T|S) are estimated from the training data and combined in a loglinear function along with several other components (such as a language model that assesses the probability of target-language word sequences); this loglinear function is applied to score hypothesized translations for each given source-language input sentence. The chapter also describes another important component of the loglinear function: the reordering model.
Chapter 6: Decoding
Decoding is the process which actually carries out translation, by generating hypotheses and scoring them according to the loglinear function described in the previous chapter. This is done by segmenting the input, source-language sentence into a sequence of phrases S such that for each S, there is at least one target-language phrase T such that P(T|S) and P(S|T) are non-zero. A hypothesis is made up of a sequence of such T's, one for each phrase S in the input sentence; the T's don't necessarily have to be in the same order as the S's that generated them. Various heuristics are applied during decoding to ensure that the search space is of reasonable size.
Chapter 7: Language models
The loglinear function used to score hypotheses during decoding always includes at least one language model. Language models for SMT pertain only to the target language; for a given sequence of target-language words (e.g., a translation hypothesis) they indicate whether or not that word sequence is a likely one, on the basis of statistics calculated from a large target-language corpus. N-gram language models break the probability of a word sequence into the product of the probabilities of each individual word, where each individual word probability is conditioned on the identity of the words that precede it. It is usually a bad idea to assign a probability of zero to a word sequence, just because you have not seen that exact sequence in the training data; the chapter describes how smoothing, interpolation, and back-off methods are used to address this problem of sparsity in estimating N-gram language models. The chapter also discusses specialized data structures and algorithms for handling the enormous language models that are often a component of today's most advanced SMT systems.
Chapter 8: Evaluation
Evaluation is one of the most hotly debated topics in the SMT community. For most pattern recognition tasks (e.g., automatic speech recognition) it is always clear how to evaluate the output obtained for a given input, but that's not true for machine translation. Given the same input sentence, several different human translators may each produce excellent translations that are so different they don't have even a single word in common. Thus, it's hard to say whether the translation produced by an SMT system for a given input system is good or not, without having a human being evaluate it (and even then one human evaluator's judgment may differ from that of another). Unfortunately, human evaluation is very expensive. SMT researchers have thus defined a variety of automatic metrics for judging machine translation quality. Typically, a test document is translated by one or more human translators; the similarity between these "reference" translations and the output of the SMT system is then measured, using the number of overlapping N-grams and similar statistics. These automatic metrics are imperfect and can be fooled in individual cases, but tend to correlate with human judgment when measured over a test text of sufficient length. This chapter describes several different automatic metrics (including the most famous one, known as "BLEU") and how to assess the statistical significance of a given score.
Chapter 9: Discriminative Training
Discriminative training aims at training a system in a way that minimizes some measure of translation error - for instance, one of the automatic metrics described in the previous chapter. For instance, the assignment of weights to the components of the loglinear scoring function for a phrase-based SMT system is carried out by a form of discriminative training. One of the key steps in the process is generation of a data structure that represents a large number of possible translations for each input sentence, such as an N-best list or a word lattice; the discriminative training algorithm searches for component weight settings that favour the best translations stored in these structures. The chapter also describes related work on posterior methods (such as minimum Bayes risk decoding) that choose the translations that are most similar to all other translations in an N-best list or word lattice. It also discusses confidence estimation methods and methods for combining outputs from different systems.
Chapter 10: Integrating Linguistic Information
This chapter handles some special cases where specialized linguistic information may profitably be applied, such as modules for transliterating foreign names to a different writing system. The chapter also discusses modules that carry out morphological analysis, syntactically-inspired reordering, or syntactically-inspired rescoring of N-best lists of candidate translations; it concludes with a discussion of research on factored translation (which Koehn himself helped to initiate).
Chapter 12: Tree-Based Models
Modern linguistic theories of language use tree structures to represent sentences. This chapter describes synchronous grammars, which represent bilingual sentence pairs as pairs of trees. It also shows how synchronous grammars can be learned from training data and how chart parsing can be applied to carry out tree-based decoding.
Introductory Papers about Statistical Machine Translation (SMT)
The order in which these papers are listed is the order recommended for a reader unfamiliar with SMT.
Brown et al 1993
"The Mathematics of Statistical Machine Translation: Parameter Estimation", by Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer. Computational Linguistics, 1993, V. 19(2), pp. 263-311 (ComputationalLinguistics2000).
This paper is the best available introduction to SMT. It was written by the group of IBM researchers who launched modern SMT in the early 1990s. Shortly afterwards, most of them left IBM to work at one of the earliest hedge funds, Renaissance Technologies. Though Renaissance Technologies is very secretive, it is considered to be one of the most successful funds in the industry (according to the "Economist" magazine of Feb. 17, 2005, it yielded returns of more than 40% for over a decade). Thus, the authors of this paper are undoubtedly far too rich to consider returning to work on SMT. Their departure from research caused SMT to lie fallow for a few years until the statistical approach to MT was revived by German and American researchers in the late 1990s (1997-9).
The paper describes five probabilistic models of translation from one language to another, in increasing order of complexity. The description is mathematically rigorous. These models are now familiarly named IBM models 1, 2, and so on. The IBM models describe one-to-many generation of a word sequence in language F from a word sequence in language E. That is, the models imagine a situation where a word in a sentence in language E may generate zero, one, or more than one words in language F; the many-to-one or many-to-many cases where a group of words in E may generate zero, one or more than one words in F is not covered by the models. Roughly speaking, the models can be described as follows. IBM model 1 is a "bag of words": each word in E can generate F words, which appear in random order. IBM model 2 is "position-dependent bag of words": it is like model 1, except that the probability that a randomly generated F word will appear in a given position in the F sequence is affected by the position in the E sequence of the E word that generated it. IBM model 3 is like model 2, except that it is fertility-based: each individual word on the E side has a parameter called "fertility" that determines whether it is likely to generate many or few F words. IBM model 4 is like model 3, except that F words generated by the same E word are more likely to appear in a contiguous sequence than they are in IBM model 3. IBM model 5 is a more sophisticated version of model 4 that avoids certain absurdities (for instance, it explicitly ensures that two F words can't be assigned the same position).
The major innovation in the field since this paper was published is the development of phrase-based models for SMT (these models originated with Franz Josef Och's "Diplomarbeit" - equivalent to an M Sc thesis - submitted to the University of Erlangen-N�rnberg in 1998). These models allow many-to-many translation, in which a group of E words generates a group of F words. For instance, phrase-based models allow the group of French words "cul de sac" to generate the group of English words "dead end" or "blind alley" (rather than the English translation which would be logical if one considered the individual French words, "ass of bag"). However, the IBM models are still used by many SMT groups in an initialization step that generates word-word alignments from a bilingual sentence alignment, prior to extraction of bilingual phrase pairs. Note that IBM1 is relatively cheap in computational terms, while models 4 and 5 are computationally expensive.
A note for Canadians: these American researchers at IBM carried out all their research on translation between English and French, because the highest-quality bilingual corpus available for them to do research on was the Canadian Hansard, the official record of Canadian parliamentary proceedings. Thus, the experimental work described in this paper depended entirely on data generated in Canada.
Goodman 2001
"A Bit of Progress in Language Modeling (Extended Version)", by Joshua Goodman. Microsoft Technical Report 2001-72, Aug. 2001 (GoodmanPublications).
The most complete survey available of modern language modeling techniques, which are based on N-grams. Although this paper is extremely well-written, in a sardonic style, a reader who is primarily interested in SMT may find that it contains far more information than he or she really needs. Unless the reader is deeply interested in language modeling, we recommend that he or she read just enough of the introductory sections to understand the basic concept of N-gram language modeling, and then stop.
PHARAOH Manual
"PHARAOH - a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models (User Manual and Description)" by Philipp Koehn. USC Information Sciences Institute, Dec. 2003 (PharaohWebPage).
Explains in readable fashion how the decoder of a typical phrase-based SMT system works. The decoder is the system component that actually carries out translation. It does this by growing each target-language hypothesis in continuous left-to-right order from sequences of words called phrases in the source sentence. Each phrase consists of a contiguous sequence of source-language words, but the phrases themselves may be "bitten off" in any order (as far as the source is concerned). For instance, the first few words at the start of a target hypothesis might be translations of words 3-5 in the source, the next few words in the hypothesis might be translations of source words 6-7, the ones after that translations of source words 1-2, and so on. Thus translation is strictly left-to-right as far as the target hypothesis is concerned (making it easy to obtain an N-gram target language model score from a partial hypothesis), but quite free as far as the source is concerned (except that once a group of source words has been translated, it cannot be used again - the words have been consumed). The paper also contains a short introduction to phrase-based SMT, describing the state of the art at the time it was written. We recommend reading this paper before others, published earlier and listed below, because it is much easier to understand phrase-based SMT once you know how it is implemented in a practical system.
BLEU Paper
"BLEU: a Method for Automatic Evaluation of Machine Translation" by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Proc. ACL, July 2002 (ACL02Proceedings).
The practical development of an SMT system requires metrics for MT quality that can be applied automatically. The SMT community's solution to this problem has been to develop metrics that measure the similarity of the computer-generated translations for the sentences in a given source text with reference translations for the same sentences generated by a professional human translator. These automatic metrics typically compare the choices of words, bigrams, trigrams and so on made by the SMT system in the target language with those made by the human translator. One of the most frequently employed metrics of this type is called BLEU, originally defined in this paper. New variants of BLEU appear quite often, so always verify that the version you are using for experiments is the appropriate one.
NOTE: all metrics of the same type as BLEU share a potential pitfall: suppose that the single reference translation of a sentence into English is "The man was very rude", system A's translation was "He expressed himself with utter tactlessness", and system B's translation was "The man was very polite". Under any metric based on N-gram overlaps, A will receive the lowest possible score, while B will receive a high score. To minimize the impact of this kind of faulty scoring, there should ideally be several different reference translations, made by translators with different writing styles.
Koehn et al 2003
"Statistical Phrase-Based Translation" by Philipp Koehn, Franz Josef Och, and Daniel Marcu. NAACL, 2003 (NAACL03Proceedings).
This paper defines phrase-based SMT. It also describes the "diag-and" phrase extraction algorithm that almost all research groups (including NRC) now use to extract phrase pairs from sentence-aligned bilingual training data for phrase-based SMT. The algorithm, which is somewhat complex, begins by training IBM models, which are used to generate word alignments betwen words in each source sentence and its matching target sentence. Phrase pairs are extracted by analyzing the patterns formed by groups of word alignments. There are some interesting research results over several language pairs showing that performance appears to be relatively insensitive to word alignment quality and to increases in the maximum phrase length allowed (abopve three words); allowing only syntactically motivated phrases appears to degrade performance.
Marcu and Wong 2002
"A Phrase-Based, Joint Probability Model for Statistical Machine Translation" by Daniel Marcu and William Wong. EMNLP, 2002 (EMNLP02Proceedings).
This is another phrase extraction algorithm, different from Koehn's "diag-and" procedure, proposed by researchers at ISI. Unlike "diag-and", it does not involve training IBM models as a first step. Somewhat more elegant than "diag-and" in conception, it does not appear to work as well in practice, and is rarely used.
Och and Ney 2004
"The Alignment Template Approach to Statistical Machine Translation" by Franz Josef Och and Hermann Ney. Computational Linguistics, V. 30, no. 4, Oct. 2004 (ComputationalLinguistics2004).
This paper was an early step on the road to phrase-based SMT. In 1998, the first author (Och) had originated the idea underlying phrase-based SMT, with its potential for modeling many-to-many relations with words, in his Diplomarbeit at the University of Erlangen-N�rnberg. The "alignment template" system described in the paper performed significantly better than others in the 2002 NIST Chinese-English MT evaluation, leading to widespread interest in this class of models among other researchers.
Och and Ney 2002
"Discriminative Training and Maximum Entropy Models for Statistical Machine Translation" by Franz Josef Och and Hermann Ney. Proc. ACL, July 2002 (ACL02Proceedings).
Shows how information sources for SMT can be treated as feature functions, making it easy to add new information sources. A key reference for people who want to understand how weights on the components of a loglinear combination model can be set so as to maximize the logprob of the model on a given set of data.
Och 2003
"Minimum Error Rate Training in Statistical Machine Translation" by Franz Josef Och. Proc. ACL, July 2003 (ACL03Proceedings).
In this paper, Franz Josef Och extends the discussion in the previous paper to show how to maximize arbitrary functions of a loglinear combination. The results in this paper show, for instance, how the weights in a loglinear model for SMT can be set so as to yield a locally optimal BLEU score. The experiments described in the paper show that significantly better results can be obtained if the criterion of system performance (e.g., BLEU) is taken into account during training. The original Portage system's weight optimization modules (COW and the rescoring module) were based on ideas expressed in this paper.
Och, Ueffing and Ney 2001
"An Efficient A* Search Algorithm for Statistical Machine Translation" by Franz Josef Och, Nicola Ueffing, and Hermann Ney. Proc. Data-Driven Machine Translation Workshop, July 2001 (WorkshopProceedings).
This paper argues for using the A* search algorithm for SMT decoding; it was a step towards today's SMT search algorithms. One of its contributions was to insist on the importance of including future costs in search, which today's SMT systems do.
Ueffing, Och and Ney 2002
"Generation of Word Graphs in Statistical Machine Translation" by Nicola Ueffing, Franz Josef Och, and Hermann Ney. EMNLP, July 2002 (EMNLP02Proceedings).
The first paper that described how SMT decoding can generate word graphs (also known as word lattices). Word graphs had been used in other domains (e.g., in automatic speech recognition) but not in SMT. The paper describes how the search algorithm should be designed so that word graphs are generated - for instance, it shows how hypotheses are recombined during search. It also describes how word graphs can be rescored, and how N-best lists can be extracted from word graphs. Finally, it compares word graphs and N-best lists in terms of oracle error rates - i.e., how good is the best hypothesis contained in a word graph, vs. the best hypothesis in an N-best list? The metric employed is word error rate (BLEU hadn't been invented yet). Incidentally, the system used for the paper wasn't phrase-based: it was a word-based system that used IBM model 4.
Knight 1999
"Decoding Complexity in Word-Replacement Translation Models" by Kevin Knight. "Squibs and Discussion" column, Computational Linguistics, V. 25, no. 4, 1999 (ComputationalLinguistics99).
Proves that word-replacement machine translation (e.g., machine translation employing IBM model 1) is NP-complete (if arbitrary word reordering is allowed).
Germann et al 2001
"Fast Decoding and Optimal Decoding for Machine Translation" by Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Proc. ACL, 2001 (ACL01Proceedings).
Looks at the question: do the errors made by today's SMT systems arise mainly because the best translations were not generated during greedy search, or because they were seen but wrongly discarded because of defective models? The paper concludes on the basis of experimental evidence for a system based on IBM model 4 that there is a higher proportion of model errors than search errors. Thus, if you believe the experimental evidence in the paper, you will conclude that the compromises embedded in the greedy search algorithms in today's SMT systems to deal with the NP-completeness of the task (see Kevin Knight NP-completeness paper) are acceptable, and SMT research should focus on better models.
JHU Workshop 2003
"Final Report of the Johns Hopkins 2003 Summer Workshop on Syntax for Statistical Machine Translation (revised version)" by Franz Josef Och, Daniel Gildea, et al. Feb. 2004 (JHUWorkshop03).
The proceedings of a JHU workshop designed to find ways of using syntactic information to improve SMT performance, attended by the top SMT researchers. The most striking result of the workshop was that the most brilliant minds in the field were unable to show any improvement yielded by syntax. The only major improvement in BLEU score came from the incorporation of information from IBM model 1 (an entirely statistical, non-syntactic model). A possible explanation for the failure of syntax to help performance is that in the workshop, all new information sources were introduced as features for rescoring, rather than for decoding. Thus, perhaps syntax was applied too late in the translation process to help. Alternatively, perhaps syntax is useless for SMT - a conclusion favoured by Franz Josef Och.
Och and Ney 2003
"A Systematic Comparison of Various Alignment Methods" by Franz Josef Och and Hermann Ney. Computational Linguistics, 2003 (ComputationalLinguistics03).
This paper looks at a variety of statistical and heuristic methods for computing word alignments, including the five IBM word alignment models and the hidden Markov model (HMM) approach to word alignment. The paper also presents a new model, which the authors call Model 6 and which is a log-linear combination of an HMM and IBM Model 4; this model yields the best results among those compared. The paper shows the superiority of all the statistical methods to the heuristic ones tried; among the statistical methods, those incorporating both first-order dependence between word positions and a fertility model work best. The evaluation criterion used in the paper is alignment error rate (AER) measured against two hand-aligned corpora, one drawn from the German-English Verbmobil task and one from the French-English Hansard task. The paper provides an excellent, clear summary of the word alignment techniques it describes. Its weakness is evaluation: subsequent work has shown that AER is unreliable as an indication of the quality of translations produced by the SMT system trained on a given set of alignments, and the paper presents no direct comparison of translation quality yielded by each of the techniques described.
Specialized Papers about Statistical Machine Translation (SMT)
These specialized papers are listed in alphabetical order by first author. They were chosen for inclusion here because they are related to recent features in PORTAGE shared.
Badr et al 2007
"Manageable Phrase-Based Statistical Machine Translation Model" by Ghada Badr, Eric Joanis, Samuel Larkin, and Roland Kuhn. Proc. of CORES Conference on Computer Recognition Systems, Oct. 2007 (FullPaperPDF, ExpandedTechReportPDF).
A state-of-the-art phrase-based SMT system contains one or more N-gram language models (LM''''''s) and one or more phrase translation models (TM''''''s), which together have a memory footprint up to several gigabytes. This paper suggests keeping an N-gram only if all the words it contains can occur together in the translation of at least one source sentence. Typical LM filtering uses one global vocabulary; this technique efficiently keeps track of a separate vocabulary for each input sentence to translate. In experiments on a large Chinese-English task, the technique yielded significant reductions in the amount of information loaded during translation (up to 58% reduction for LM''''''s, and up to 75% reduction for TM''''''s).
Bisazza et al 2011
"Fill-up versus Interpolation Methods for Phrase-based SMT Adaptation" by Arianna Bisazza, Nick Ruiz, and Marcello Federico. IWSLT 2011.
This paper discusses an adaptation strategy where there is a single phrase table derived from both in-domain & out-of-domain corpora. Phrase pairs from out-of-domain corpora are included in the table, with their counts, only if the phrase pair was NOT observed in-domain. An indicator feature keeps track, for each phrase pair, of whether it's in-domain or out-of-domain. The paper describes a cascaded variant with multiple out-of-domain corpora, and other variants (e.g., where out-of-domain phrase pairs are inserted only if the source phrase is not in the original in-domain table, or only if the source phrase has four or fewer words in it). The results in the paper for this "fill-up" approach are good. However, that may largely be due to the approach having fewer features than its competitors, combined with MERT's difficulties with many features.
Chen et al 2009
"Phrase Translation Model Enhanced with Association based Features" by Boxing Chen, George Foster and Roland Kuhn. Machine Translation Summit XII, 2009 (MTSummitProceedings).
This paper suggests several "association" feature functions for scoring phrase pairs, based for the most part not on explicit alignment between the two phrases but on their occurrence in aligned sentences. That is, if a source phrase S and a target phrase T co-occur in aligned sentences more often than chance would predict, the (S,T) phrase pair will be "rewarded" by the association features. The association features are highly correlated; it turns out that there is a gain in using two of them rather than one, but no gain for using more than two. Using two of the association features, Chinese-English performance improvements for a small data condition are in the range 0.7-0.8 BLEU, and for a large data condition 0.6-0.7 BLEU.
Cherry and Foster 2012
"Batch Tuning Strategies for Statistical Machine Translation" by Colin Cherry and George Foster. NAACL 2012.
See Cherry_andFoster2012revisited#CherryandFoster2012revisited below for a second presentation of this paper, adding details about sparse features as included in Portage''''II 3.0.
In the opinion of the writer (R. Kuhn, June 2012) the work described in this paper will eventually be seen as one of the most important contributions of NRC's team to statistical machine translation (SMT). SMT systems utilize several different information sources, traditionally called "features", to tell them which possible translations of a given source-language sentence into the target language are good and which are bad. Each of the features contributes to the overall score of a translation hypothesis. Since some features provide very accurate information about translation quality, and others only weak information, the features are assigned fixed weights before translation starts. To perform well, the assignment of weights to features must reflect the quality of the information from each feature.
The algorithm most commonly used for the assignment of weights to features ("tuning") is MERT. MERT works well for tuning a small number of features, but for more than around 20 or 30 features, it becomes confused and assigns unreliable weights; it also becomes too computationally expensive. The practical effect has been that when one adds new features - even potentially powerful ones - to a system that already has 30 features and is being tuned with MERT, there is little or no improvement, and at a certain point tuning becomes impractically slow. The current paper examines eight tuning strategies: two different versions of MERT, three other strategies from the literature, and three new strategies. Experiments involving these eight tuning strategies are carried out on three different language pairs (English-French, French-English, and Chinese-English) for three different settings: a "Small" setting with 7 features, a "Medium" setting with 18 features, and a "Big" setting with sparse Boolean features added to "Medium", for a maximum of 6,848 features. The two versions of MERT perform well for "Small", do slightly worse than the other six strategies for "Medium", and are too computationally expensive for "Big".
Overall, a new algorithm called "batch lattice MIRA" which is a hybrid of some of the other approaches, emerges as the winner in the "Medium" and "Big" settings: it can be computed in a reasonable amount of time even when there are many features, it almost always yields the best-performing system, and its performance is stable (low standard deviation over a set of randomized trials). Subsequent to the work described in this paper, batch lattice MIRA was used to tune NRC's Chinese-English and Arabic-English submissions to NIST Open MT 2012. These systems performed extremely well, in large part because batch lattice MIRA allowed them to exploit a large number of features productively. To sum up, this paper not only explores the properties of many tuning strategies from the literature in an extremely thorough set of experiments, it also describes a new strategy that has more desirable characteristics than any of the older strategies.
Cherry et al 2012
"On Hierarchical Re-ordering and Permutation Parsing for Phrase-based Decoding" by Colin Cherry, Robert Moore and Chris Quirk. WMT 2012.
This paper describes the type of Hierarchical Lexicalized Distortion Model (HLDM) currently implemented in PORTAGE shared. The concept of an HLDM goes back to the 2008 paper "A Simple and Effective Hierarchical Phrase Reordering Model" by Galley & Manning (Galley_andManning2008#GalleyandManning2008). The basic idea is to model the reordering of words that occurs during translation not by tracking the movement of the actual phrases into which a source sentence was segmented, but by tracking the movement of the longest possible phrases that COULD have been used to generate the translation. This "keeping track" is done by a permutation parser. The current paper is highly technical; much of it is recommended reading only for people who are deeply interested in parsing. The permutation parser described here is a more efficient version of the parser underlying Galley & Manning's HLDM; it also removes the possibility for mismatch between behaviour during HLDM training and use of the HLDM during translation (this mismatch can happen with the Galley & Manning HLDM in certain scenarios). In practice, there is little difference between the BLEU results obtained from the Galley & Manning version of HLDM and the new version presented here, but the latter requires slightly less computation. The paper also shows how the new type of HLDM can correctly enforce ITG constraints that were not always enforced by an older "coverage vector" approach (this aspect of the paper will not be of great interest to typical users of PORTAGE shared).
Foster et al 2006
"Phrasetable Smoothing for Statistical Machine Translation" by George Foster, Roland Kuhn and Howard Johnson. Proc. Empirical Methods in Natural Language Processing (EMNLP) 2006 (EMNLP06Proceedings).
When estimating probabilities from data, it is often a good idea to "smooth" the estimated probabilities to account for randomness in the observations - especially if there have not been many observations. For instance, suppose I know there are four possible types of ducks in a certain region (A-D), and in the course of an afternoon, I have observed 10 different ducks on a lake in that region: 5 were of type A, 4 were of type B, and 1 was of type C. On the basis of these observations, the unsmoothed estimated probabilities of the different types would be 50% for type A, 40% for type B, 10% for type C, and 0% for type D. This is clearly unreasonable - on the basis of these 10 observations, would you be willing to bet $100 that if you went back to the lake the next afternoon, you will not observe any type D duck, or that exactly half the ducks you observe will be of type A? Several smoothing methods exist for obtaining more reasonable probability estimates in cases like this, where the number of observations isn't much bigger (or is perhaps even smaller) than the number of different things that could possibly be observed. Application of a smoothing method to the duck data would yield an estimate for the probability of A that was lower than 50%, and a probability for D that was higher than 0%.
The two key components of a phrase-based SMT system like PORTAGE shared are the target language model, which assigns scores to proposed sequences of target-language words, and the phrase table, which stores counts of co-occurrences of source-language and target-language phrases. Data stored in the phrase table yields conditional probability estimates - for instance, if source phrase s1 occurs 3 times, once co-occurring with target phrase t1 and twice with target phrase t2, one might estimate P(t1|s1) as 33%, and P(t2|s1) as 67%. There is a large literature about smoothing for language modeling. Strangely, this Foster et al. 2006 paper appears to be the first systematic study of methods for smoothing probabilities estimated from phrase tables (even though the number of observations per phrase in the phrase table is typically very small, implying that phrase table smoothing is highly desirable). Several different smoothing methods are tried; some are "black-box" (they only use counts of phrases and phrase pairs), some are "glass-block" (they decompose the source and target phrases into their constituent phrases). Two experimental settings were used: European language pairs with small training corpora, and Chinese-to-English translation with a large corpus. In both settings, phrase table smoothing led to an improvement of at least 1 BLEU point.
Foster and Kuhn 2007
"Mixture-Model Adaptation for SMT" by George Foster and Roland Kuhn. Proc. Second SMT Workshop, 2007 (WMT07Proceedings).
This paper examines how to adapt an SMT system to a new domain, using weights that depend on the distance of the source text to be translated to corpora used to train the mixture components. Many variations of this approach are investigated. The best methods achieve gains of approximately one percentage point over a state-of-the-art non-adapted baseline system.
Foster and Kuhn 2009
"Stabilizing minimum error rate training" by George Foster and Roland Kuhn. Proc. Fourth SMT Workshop, 2009 (WMT09Proceedings).
The method most commonly used for determining feature weights in an SMT system is Och's minimum error rate training (MERT) procedure. Unfortunately, MERT tends to be unstable, especially when the number of features is large or when the features are highly correlated. This paper analyzes variations in MERT's behaviour by supplying different random seeds to a core component of MERT, Powell's algorithm. It looks at several strategies for obtaining more reliable MERT results, the simplest of which is to run MERT several times with different random seeds and then pick the weights that give the best result on a held-out corpus. The paper also proposes some other approaches to improving MERT, which also yield good results and are less expensive than the simple approach.
Fraser and Marcu 2007
"Measuring Word Alignment Quality for Statistical Machine Translation" by Alexander Fraser and Daniel Marcu. Computational Linguistics 33(3) (ComputationalLinguistics07).
A frustrating aspect of training SMT systems is that the quality of the word alignments produced in the early stages of training does not seem to correlate with the quality of translations produced by the system. The authors argue that this is due to use of AER as a measure of the quality of word alignment; according to them, AER favours precision at the expense of recall. They argue instead for the use of the F-measure, showing experimentally that if the alpha parameter of this measure is properly tuned, the F-measure for a set of word alignments correlates quite well with the BLEU score of the resulting SMT system.
Galley and Manning 2008
"A Simple and Effective Hierarchical Phrase Reordering Model" by Michel Galley and Christopher Manning. EMNLP 2008.
This paper builds on earlier work on lexicalized reordering models in which each phrase being added to a translation hypothesis is classified as having one of three possible orientations with respect to the previous phrase: monotone (M), swap (S), or discontinuous (D) (sometimes, the D orientation is split into left discontinuous and right discontinuous). The probability of each of these orientations can be estimated in a way that depends on the words of the current phrase or of the previous phrase (on either the source or the target side), hence the description of these models as "lexicalized". The problem with this earlier work is that the orientation assigned to a new phrase depends in a rather arbitrary way on the segmentation of the source sentence. Let us consider two French-to-English SMT systems that have identical phrase tables except that the first one has a phrase table with the phrase pair ("chat noir", "black cat") and the second does not, though its phrase table contains the pairs ("chat", "cat") and ("black", "noir"). Suppose both systems have successfully decoded the sentence "le chat noir a mang�". The first system decoded "le" as "the", "chat noir" as "black cat", and "a mang�" as "ate". Thus, all phrases were M with respect to the preceding phrase. The second system decoded "le" as "the" (M), then inserted "black" for "noir" - a D operation. Then it reached backwards to insert "cat" for "chat" - an S operation. Then it inserted "ate" for "a mang�" - a D operation. Thus, the sequence of orientations for the first system is M M M, and for the second system M D S D. It is difficult to believe that a classification scheme that produces such dissimilar sequences of orientations for two very similar cases is the best way of modeling reordering.
The solution proposed here is to classify orientations based on the longest phrases that MIGHT have been used to cover the source sentence in the past. In this model, by the time the second system in the example translates "a mang�" as "ate", "chat" is part of a block of translated words that includes "noir", so the insertion of "ate" counts as M. The authors point out that the older approach often ended up with a very high proportion of D orientations - e.g., 80% in their Chinese-English data - and thus D ends up being an uninformative default category. The new Hierarchical Lexicalized Distortion Model (HLDM) described here usually increases the proportion of M and S orientations. Experiments in the paper show that HLDM yields significant improvements for Arabic-English and Chinese-English.
Germann et al 2009
"Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too" by Ulrich Germann, Eric Joanis, and Samuel Larkin. Proc. NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing, 2009 (SETQA_NLP_Proceedings).
This paper describes tightly packed tries, a data structure used by PORTAGE shared to store language models and phrase tables - including gigantic ones - as compactly as possible; they can also be used to store many other types of NLP data. Tightly packed tries store data more compactly than flat text files compressed with the gzip utility. They support memory mapping and thus allow very short load times and memory mapping between processes. This paper may be read in conjunction with Paul et al 2009 (below), which describes Portage Live: a VM version of PORTAGE shared made possible by the use of tightly packed tries.
Guzman et al 2009
"Reassessment of the Role of Phrase Extraction in PBSMT" by Francisco Guzman, Qin Gao and Stephan Vogel. MT Summit XII, 2009.
This paper studies the relationship between the number of unaligned words in a phrase pair and translation quality. As one would expect, phrase pairs with more unaligned words tend to be of lower quality. The authors introduced two new features based on phrase pairs, the number of unaligned source words and the number of unaligned target words. Introduction of these two features improved BLEU by up to +2.
He 2007
"Using Word Dependent Transition Models in HMM based Word Alignment for Statistical Machine Translation" by Xiaodong He. ACL07 2nd SMT Workshop (WMT07Proceedings).
In a traditional HMM for word alignment, the HMM transition probability depends only on the jump width from the last state to the next state. This paper proposes that transition probabilities be conditioned on the identity of the current source word. In order to compensate for data sparsity, a maximum a posteriori (MAP) framework is proposed, in which word-dependent transition probabilities are smoothed with the word-independent probabilities. The authors obtain a 13% reduction in AER on a Hansard task over a conventional HMM baseline. On the English-French track of the NAACL 2006 Europarl evaluation workshop (688K sentences of training data), the word-dependent HMM yields nearly 1% of BLEU improvement over the baseline HMM, and 0.8% over IBM model 4. However, the improvement is smaller when tested on out-of-domain data: 0.5% improvement over both the baseline HMM and IBM model 4.
Huang and Chiang 2007
"Forest Rescoring: Faster Decoding with Integrated Language Models" by Liang Huang and David Chiang. Proc. ACL, 2007 (ACL07Proceedings).
This paper shows how to achieve an order-of-magnitude speedup in SMT beam-search decoding by avoiding a priori the expansion of nodes in the search graph that would be pruned latter anyway. Two methods for doing this are introduced, called cube pruning and cube growing.
Johnson et al 2007
"Improving Translation Quality by Discarding Most of the Phrasetable" by Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. Proc. Conf. on Empirical Methods for Natural Language Processing and Conf. on Computational Natural Language Learning (EMNLP-CONLL), 2007 (EMNLP07Proceedings).
The two key components of a phrase-based SMT system like PORTAGE shared are the target language model, which assigns scores to proposed sequences of target-language words, and the phrase table, which stores counts of co-occurrences of source-language and target-language phrases. Data stored in the phrase table yields conditional probability estimates - for instance, if source phrase s1 occurs 3 times, once co-occurring with target phrase t1 and twice with target phrase t2, one might estimate P(t1|s1) as 33%, and P(t2|s1) as 67%. This influential paper showed that by applying a statistical significance test (Fisher's exact test), it is possible to discard information about a high proportion of the phrase pairs stored in the phrase table - in some cases, up to 90% of them - without incurring any loss in BLEU score. Sometimes, the pruned phrase table even yields a higher BLEU score than the original one (though this is less likely to happen when smoothing is being applied - see Foster et al 2006, above, for a description of smoothing).
Koehn et al 2007
"Edinburgh System Description for the 2005 IWSLT Speech Transcription Evaluation" by Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. MT Eval Workshop 2005 (IWSLT05Proceedings).
Also "Moses: Open Source Toolkit for Statistical Machine Translation" by Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin and Evan Herbst. Proc. ACL Demo and Poster Sessions, 2007 (ACL07Proceedings).
The second of these two papers provides a very brief overview of Moses, the open-source SMT toolkit. It touches on some key features of Moses. These include the ability to create a factored representation of words (e.g., a word might be associated with its part of speech and its lemma), confusion network decoding (which can be used, for instance, when the imput to the SMT system comes from a speech recognizer), and the use of extremely efficient data structures for storing the translation model and the language model. It briefly mentions lexicalized distortion, but the first paper is a better reference for that topic. "Lexicalized" distortion is a bit of a misnomer, since it doesn't involve conditioning on individual words but rather, conditioning on phrases or phrase pairs. The basic idea is this: a phrase pair in the training data can be classified into three categories with respect to the phrase pair that precedes it, and into three categories with respect to the phrase pair that follows it. In both cases, these categories are monotone, swap (the current phrase pair switched positions with the other one), or discontinuous (the current phrase pair and the other one were originally not contiguous). If during training, we record for each phrase pair its category with respect to the preceding and following phrase pair, we can estimate the probability that during decoding, it will have a given orientation with respect to the preceding and following phrase pairs. Many variations of this idea are also possible; most of them lead to performance improvements in most situations. PORTAGE shared incorporates a version of lexicalized distortion.
Kuhn and De Mori 1990-2
"A Cache-Based Natural Language Model for Speech Recognition" by Roland Kuhn and Renato De Mori. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), June 1990; see also corrections to this article in PAMI, June 1992.
The probability that a word will appear in a running text depends not only on the local N-gram statistics, but also on whether it has occurred earlier in the same text: recently employed words have a higher probability of occurring again soon than the N-gram statistics would predict.
Kumar and Byrne 2004
"Minimum Bayes-Risk Decoding for Statistical Machine Translation" by Shankar Kumar and William Byrne (2004). Proc. HLT-NAACL 2004 (HLT-NAACL04Proceedings).
The authors show how to carry out Minimum Bayes Risk (MBR) decoding. The aim of this approach is to minimize expected loss of translation errors under loss functions that measure translation performance. Thus, MBR decoding can be used to tune SMT performance for a specific metric, such as BLEU. This can be done in the context of rescoring N-best lists, resulting in selection of the translation hypothesis that is closest on average to all the other hypotheses, with closeness measured by the metric of interest. Thus, MBR rescoring results in selection of a consensus translation.
Liang et al 2006
"Alignment by Agreement" by Percy Liang, Ben Taskar, and Dan Klein. Proc. HLT-NAACL 2006 (HLT-NAACL06Proceedings).
The authors train two asymmetric HMM models for word alignment between two languages E and F, one in the direction P(F|E) and the other in the direction P(E|F). Instead of just intersecting the predictions of these two models after training, they have devised a training method that encourages agreement of the models during training. They test their joint training method on the English-French NAACL 2003 shared task, involving a training set of 1.1 million Canadian Hansard sentences. Compared to the standard approach of intersecting predictions obtained from independently-trained models, they obtained a 32% reduction in AER. The gain in BLEU was more modest: as measured by training on 100K English-French sentences from Europarl and testing on 3K Europarl sentences, the BLEU score for their method was 0.3051 as compared to 0.3035 obtained by the use of IBM model 4.
Mangu et al 1999
"Finding Consensus Among Words: Lattice-Based Word Error Minimization" by Lidia Mangu, Eric Brill, and Andreas Stolcke. Proc. Eurospeech 1999 (OnlineAbstract).
The paper describes an algorithm for finding the hypothesis in a lattice that minimizes the word error rate (WER). This is done by finding a complete alignment of all the words in the lattice, identifying mutually supporting and competing word hypotheses. A new sentence hypothesis is formed by concatenating the words with maximal posterior probabilities. Expensive!
Niehues et al 2011
"Wider Context by Using Bilingual Language Models in Machine Translation" by Jan Niehues, Teresa Herrmann, Stephan Vogel and Alex Waibel. Proceedings of the 6th Workshop on Statistical Machine Translation, 2011.
This describes Karlsruhe's "bilingual language model". This language model is a clever way of applying extra source-side contextual information (beyond phrase boundaries) at decoding time without disrupting the current architecture of systems like PORTAGE shared. That is, search is the same as in our current system; it is based on phrase pairs. The bilingual language model score, calculated from the sequence of bilingual tokens in the translation hypothesis, is an additional feature function. Each bilingual token consists of a target-language word and the source words it was aligned to in the training data. Source words that aren't aligned to target words are ignored in this model; unaligned target words are just bilingual tokens that happen to have an empty source side. Incorporation of wide-span source-side POS information (potentially much wider than the span of either side of a typical phrase pair) also becomes straightforward with this approach. Improvements for German-English were in the range +0.15-1.0 BLEU (for the word-based version), and for Arabic-English, around +1 BLEU for the word-based version and +1.7 for the (word+POS) version. However, for the WMT 2011 experiments described in the paper, the bilingual LM component only contributed improvement in the range +0.1-0.4 BLEU. An interesting side-effect of using a bilingual LM is that the average length of phrase pairs used for decoding tends to shrink, since the system now has another mechanism for incorporating information about source-side context besides using longer phrase pairs whose statistics may be less reliably estimated than the statistics for short phrases.
Paul et al 2009
"Portage Live: Delivering Machine Translation Technology via Virtualization" by Patrick Paul, Samuel Larkin, Ulrich Germann, Eric Joanis, and Roland Kuhn. Proc. MT Summit XII, 2009 (MTSummitProceedings).
Virtualization makes it possible to port state-of-the-art machine translation technology, such as PORTAGE shared, to a variety of operating systems with little effort, and to run several instances of the translation decoder on the same machine. This paper describes Portage Live, a virtualized version of PORTAGE shared. Portage Live can run concurrently on a single state-of-the-art desktop or laptop computer with other applications; it can also be used in a network for distributed translation processing. The virtualization of PORTAGE shared was made possible by encoding PORTAGE shared's phrase tables and language models in a data structure called a tightly packed try. Tightly packed tries are described in Germann et al. 2009 (above); it is strongly recommended that the Portage Live paper be read in conjunction with the Germann et al one. Thanks to tightly packed tries, even very large models for translation can be fitted into a reasonable amount of memory; loading of these models is very fast. To summarize, this paper shows how to virtualize PORTAGE shared, allowing fast transfer of new technology to existing and potential customers without significant investments in porting software between operating systems.
Simard and Isabelle 2009
"Phrase-based Machine Translation in a Computer-assisted Translation Environment" by Michel Simard and Pierre Isabelle. Proc. MT Summit XII, 2009 (MTSummitProceedings).
One of the most promising applications of SMT technology is making human translators more productive. Currently, many translators are uncomfortable with SMT, for two reasons: it hasn't been integrated into a computer-assisted translation (CAT) environment, and it wastes the translator's time by often proposing bad translations. This paper tackles both objections to SMT by merging it with a translation memory (TM) - a tool most translators are comfortable with - and by providing a component that filters out SMT-generated translations that are unlikely to be useful. The experiment shows that the merged SMT-TM system outperforms the pure TM on all data sets that were tried, and that the filtering component is very effective at suppressing unhelpful translations.
Ueffing and Ney 2005
"Word-Level Confidence Estimation for Machine Translation using Phrase-Based Translation Models" by Nicola Ueffing and Hermann Ney. Proc. HLT-EMNLP, 2005 (HLT-EMNLP05Proceedings).
This paper describes several different ways of assigning a confidence score to each word in a translation output by a phrase-based SMT system. Several different methods for finding the confidence are studied experimentally. The most effective proves to be a confidence measure estimated from the phrase table of the SMT system.
Zens and Ney 2004
"Improvements in phrase-based statistical machine translation" by Richard Zens and Hermann Ney. Proc. HLT-NAACL, 2004 (HLT-NAACL04Proceedings).
Describes RWTH's phrase-based system as it was in 2004. Apart from the "relative frequency" estimates of the forward and backward conditional phrase probabilities P(s|t) and P(t|s) based on the counts of co-occurrences of s and t in the training data (where s = source-language phrase, t=target-language phrase), this system incorporates:
- A word penalty feature (count of number of words in a target-language hypothesis)
- A phrase penalty feature (count of number of phrases t in a target-language hypothesis)
- Lexical estimates for P(s|t) and P(t|s) - see below.
The paper describes experiments in which a monotone search strategy (in which the source sentence is translated left-to-right) is compared with normal search (in which reordering of phrases is allowed); monotone search is always faster. As would be expected, monotone search only mildly reduces the quality of translation for language pairs that have rather similar word order (French-English and Spanish-English) but is quite harmful when the two languages have different word order (German-English).
The aspect of this paper that's important for understanding today's PORTAGE shared system is the idea of lexical estimates for P(s|t) and P(t|s). The phrases s and t may only have occurred a few times in the training data; thus, the "relative frequency" estimates of P(s|t) and P(t|s) based on the count of s, the count of t, and the count of co-occurrences of s and t may be quite unreliable. However, if s or t are multi-word phrases, one can also estimate P(s|t) and P(t|s) from the co-occurrence counts of the individual words that make up s and t. This paper proposes a "noisy-or" lexical estimate for P(s|t) and P(t|s); the default setting for PORTAGE shared typically uses this Zens-Ney lexical estimate, as defined in this 2004 paper.
Zens and Ney 2006
"N-Gram Posterior Probabilities for Statistical Machine Translation" by Richard Zens and Hermann Ney. Proc. SMT Workshop, 2006 (WMT06Proceedings).
Given an N-best list of translation hypotheses output by an SMT system, one can estimate language models and sentence length distributions on the list. The resulting posterior probabilities are then used to reorder the hypotheses in the list, or to do a second pass of translation. Instead of an N-best list, a word graph could also have been used. Note that this procedure favours hypotheses that look as much like the other hypotheses as possible - thus, the use of posterior probabilities favours the consensus.The authors claim impressive gains for a Chinese-English task: well over one BLEU percentage point.
Papers about features added in Portage''''II 3.0
Hopkins and May 2011
"Tuning as Ranking" by Mark Hopkins and Jonathan May, EMNLP 2011, Edinburgh, Scotland (http://www.aclweb.org/anthology/D/D11/D11-1125.pdf).
Most of this paper is devoted to explaining the PRO (Pairwise Ranking Optimization) method for tuning weights on the components of an SMT system. SMT systems utilize several different information sources, traditionally called "features", to tell them which possible translations of a given source-language sentence into the target language are good and which are bad. Each of the features contributes to the overall score of a translation hypothesis. Since some features provide very accurate information about translation quality, and others only weak information, the features are assigned fixed weights before translation starts. To perform well, the assignment of weights to features must reflect the quality of the information from each feature. The process of assigning weights to features is called "tuning".
This paper is one of the "ancestors" of the batch lattice MIRA tuning algorithm employed by PORTAGE shared. If you are interested in learning how PORTAGE shared does tuning, reading the 2012 paper below by members of the NRC team - "Batch Tuning Strategies for Statistical Machine Translation" - is a higher priority than reading this one. Our 2012 paper describes the batch lattice MIRA tuning method used by PORTAGE shared.
However, there is another aspect of this paper by Hopkins & May that is directly relevant to PORTAGE shared. When these researchers designed PRO, their main motivation was to make it possible to exploit thousands of features. MERT, the most commonly-used tuning method at the time the Hopkins-May paper was published, worked well with a handful of features but failed to capture the additional information when more than (roughly) 10 or 20 features were used. The original form of MIRA was capable of exploiting information from hundreds or thousands of features, but had a very complicated architecture. Hopkins and May's PRO tuning method was designed to combine the ease of implementation and use of MERT with the scalability of MIRA (as was our batch lattice MIRA technique, devised the next year).
The part of the Hopkins & May paper that's directly relevant to today's PORTAGE shared is the description of the sparse features they used (section 5.3). We only use the PBMT (phrase-based MT) features described here, but not the SBMT (syntax-based) ones, because PORTAGE shared is a phrase-based system. The experiments by Hopkins & May used a set-up that allowed up to 6,723 PBMT features in principle, not far off the maximum number of 6,848 features in our "Big Set" experiments in "Batch Tuning Strategies for Statistical Machine Translation" (see below), which are essentially the "hopmay" sparse features in PORTAGE shared.
Cherry and Foster 2012 revisited
"Batch Tuning Strategies for Statistical Machine Translation" by Colin Cherry and George Foster, NAACL 2012, Montr�al, Canada (http://www.aclweb.org/anthology/N12-1047.pdf).
We discuss this paper a second time here to add a focus on elements of it implemented in Portage''''II 3.0. (See also Cherry_andFoster2012#CherryandFoster2012 above.)
The work described in this paper may eventually be seen as one of the most important contributions of NRC's team to statistical machine translation (SMT). SMT systems utilize several different information sources, traditionally called "features", to tell them which possible translations of a given source-language sentence into the target language are good and which are bad. Each of the features contributes to the overall score of a translation hypothesis. Since some features provide very accurate information about translation quality, and others only weak information, the features are assigned fixed weights before translation starts. To perform well, the assignment of weights to features must reflect the quality of the information from each feature.
The algorithm most commonly used for the assignment of weights to features ("tuning") is MERT. MERT works well for tuning a small number of features, but for more than around 20 or 30 features, it becomes confused and assigns unreliable weights; it also becomes too computationally expensive. The practical effect has been that when one adds new features - even potentially powerful ones - to a system that already has 30 features and is being tuned with MERT, there is little or no improvement, and at a certain point tuning becomes impractically slow. The current paper examines eight tuning strategies: two different versions of MERT, three other strategies from the literature, and three new strategies. Experiments involving these eight tuning strategies are carried out on three different language pairs (English-French, French-English, and Chinese-English) for three different settings: a "Small" setting with 7 features, a "Medium" setting with 18 features, and a "Big" setting with sparse Boolean features added to "Medium", for a maximum of 6,848 features (see below for explanation of sparse features). The two versions of MERT perform well for "Small", do slightly worse than the other six strategies for "Medium", and are too computationally expensive for "Big".
Overall, a new algorithm called "batch lattice MIRA" which is a hybrid of some of the other approaches, emerges as the winner in the "Medium" and "Big" settings: it can be computed in a reasonable amount of time even when there are many features, it almost always yields the best-performing system, and its performance is stable (low standard deviation over a set of randomized trials). Subsequent to the work described in this paper, batch lattice MIRA was used to tune NRC's Chinese-English and Arabic-English submissions to NIST Open MT 2012. These systems performed extremely well, in large part because batch lattice MIRA allowed them to exploit a large number of features productively. To sum up, this paper not only explores the properties of many tuning strategies from the literature in an extremely thorough set of experiments, it also describes a new strategy that has more desirable characteristics than any of the older strategies.
In addition to describing PORTAGE shared's tuning algorithm, batch lattice MIRA, this paper is of interest to users of PORTAGE shared because it describes a set of sparse features available to them. A sparse feature is a feature that is irrelevant most of the time. For instance, it might be a Boolean that is set to 0 for most input, source-language sentences, and that is set to 1 only when the word "elephant" appears in the input sentence. Clearly, it will be 0 during the decoding of most sentences.
Here is the paper's description of the sparse features - the so-called "Hopkins-May" features - that are added to the 18 features in the Medium set to create the Big set (section 4.2):
"The Big set adds sparse Boolean features to Medium, for a maximum of 6,848 features. We used sparse feature templates that are equivalent to the PBMT set described in (Hopkins_andMay2011#HopkinsandMay2011): ''tgt unal'' picks out each of the 50 most frequent target words to appear unaligned in the phrase table; ''count bin'' uniquely bins joint phrase pair counts with upper bounds 1,2,4,8,16,32,64,128,1k,10k,1; ''word pair'' fires when each of the 80 most frequent words in each language appear aligned 1-1 to each other, to some other word, or not 1-1; and ''length bin'' captures each possible phrase length and length pair".
The "hopmay" features in the PORTAGE shared code are defined exactly the same way as these Hopmay-May features in the Big Set. Note that PORTAGE shared has another set of sparse features that enable an advanced form of lexicalized distortion, DHDM, as described in "Improved Reordering for Phrase-Based Translation using Sparse Features" by Colin Cherry. See the description of this paper elsewhere in this bibliography. Ideally, these two papers, the Cherry & Foster 2012 paper and the Cherry 2013 paper, should be read together.
In principle, the sparse features in PORTAGE shared could be real numbers. As currently implemented, they must be Booleans or integers.
Cherry 2013
"Improved Reordering for Phrase-Based Translation using Sparse Features" by Colin Cherry, NAACL-HLT 2013, Atlanta, USA (http://www.aclweb.org/anthology/N12-1047.pdf).
The central goal of this paper is to improve lexicalized reordering (which is sometimes called "lexicalized distortion") in an SMT system. In translating from one language to another, the order of corresponding words may change in a systematic way. The nature of the change varies from language pair to language pair. For instance, verbs in an English sentence often "move" when the sentence is translated into German, to the right. Thus, the sentence "I have read the green book" translates into German as "Ich habe das gr�ne Buch gelesen" - literally, "I have the green book read". English and French sentences typically place a given verb in roughly the same place. On the other hand, English always places adjectives in front of the noun, while French often - but not always - places them after the noun (German behaves like English in this respect). "I have read the green book" translates into French as "J'ai lu le livre vert" - literally, "I have read the book green".
All SMT systems have a "reordering" or "distortion" model that tries to ensure words end up in the right place. The oldest, simplest reordering model is called a distortion penalty: it penalizes translations that rearrange words. This does not mean that the translations coming out of a system with a distortion penalty will always have words in the same order as in the source sentence, because the phrase table has memorized many phrase pairs, and the language model (LM) favours frequent sequences of words in the target language. If the English-French phrase table has memorized the translation of "green book" into "livre vert", and the French LM has been trained on sentences that include the word sequence "livre vert" or "le livre vert", "I have read the green book" will probably be translated correctly as "J'ai lu le livre vert" ("I have read the book green") rather than wrongly as "J'ai lu le vert livre" because the signal from the phrase table and the LM is stronger than the signal from the distortion penalty.
The problem with this approach is that it doesn't generalize to word sequences that weren't in the training data for the phrase table or LM. If "yellow book" wasn't in the training data, "I have read the yellow book" will probably be translated as "J'ai lu le jaune livre" (with the adjective incorrectly in front of the noun, as in English). Lexicalized distortion models deal with this problem by memorizing the tendency of words - and sometimes, depending on the model, word classes - to move. Such a model may be able to learn, for instance, that English adjectives like "yellow" have a tendency to "move" to the right of the nouns they were immediately to the left of when being translated into French. If it models both words and word classes, the model may even be able to learn exceptions to general rules of this type (e.g., it can memorize as exceptions the small number of French adjectives such as "petit" and "grand" that tend to precede, not follow, the nouns they modify).
Most lexicalized distortion models are expressed in a way that makes them easy for the decoder to apply. Recall that in phrase-based decoding, the target-language translation hypothesis grows monotonically from left to right, but the source-language phrases chosen for translation may be in any order (though for most language pairs, they also tend to be chosen in roughly left-to-right order, with exceptions). Consider a source sentence made up of the phrase sequence A B C D E F G (we'll pretend there is only one way of segmenting this sentence into phrases). Suppose that the decoder has just translated "D" and is deciding which phrase to translate next.
We know that most often, for most language pairs, the correct choice will be the monotone one: the next phrase in the source sentence, "E". Sometimes, it will be the preceding one, "C" (as when we've just translated the word "book" in "green book" to "livre" and then translate the preceding word "green" into "vert" so that the target hypothesis now contains the word sequence "livre vert"). This is called the swap choice. Finally, the decoder might choose to translate a phrase that isn't adjacent to "D" - a non-adjacent phrase to the left like "A" or "B", or one to the right like "F" or "G". This is called a discontinuous choice. Some distortion models distinguish between the two discontinuous choices, i.e. between a left discontinuous and a right discontinuous choice, but in what follows, we will assume that only three orientations - monotone (M), swap (S), and discontinuous (D) - are being used.
In the preceding paragraph, the orientations M, S, and D are defined in terms of phrases. In the paper, they are defined hierarchically. That is, they are taken with respect to the longest previous source word sequence that could have been translated as a single phrase (whether or not that source phrase was actually in the phrase table or not). This yields more consistent definitions of M, S, and D that are easier to learn.
This paper compares two different ways of encouraging the decoder to make the appropriate choice between M, S, and D. The first, "maximum entropy" reordering, derives probabilities from the number of times each of these choices has been made in a given situation in the sentence-aligned bilingual training data. For instance, in translating the English sentence "I have read the green book" into French, once the first part - "I have read the ..." - has been translated into French as "J'ai lu le ...", the decoder must decide between the M option - translate "green" next - or a number of D options. In this case, one of the D options - translate "book" next - followed by the S option - which will translate "green" after that - yields the correct translation, "J'ai lu le livre vert". In maximum entropy (ME) reordering, three probabilities are applied at each step during decoding: P(M), P(S), and P(D). The predictors are aspects of the phrase pair currently being considered, such as the first and last words in the target and source phrase, and the Mkcls() classes of these four words. The ME model is estimated from millions of counts of M, S, and D cases in the training data; the estimation procedure minimizes perplexity.
The second way of handling the choice between M, S, and D is called "sparse" reordering. It does not use the training data. Instead, the same kind of predictors as those used by ME reordering are turned into features whose loglinear weights are estimated by the usual tuning process, which maximizes BLEU. In addition to these features, based on the first and last words in its source and target phrase, etc, sparse reordering can also exploit contextual information available during decoding, such as the identity of the words in the current top of the decoding stack. Because this sparse reordering model has access to far less data than the ME model (the former is trained on the "dev" tuning data, which is tiny compared to the full training data on which the latter is estimated), only the most frequent 80 words from the source and target languages are represented explicitly in the sparse reordering model. All other words are represented by their Mkcls() class.
In the Arabic > English experiments reported in the paper, the ME model maintains 530 million probabilities, while the sparse model has only about 4,000 features. The big surprise from these experiments is that despite being estimated from far more data and providing far more information, the ME model turned out to be much less effective in improving performance than the sparse model. The former has little impact on BLEU, while the latter improves BLEU by between +1.0 and +1.8 points. The author believes this is because the sparse model is trained to directly optimize BLEU, while the ME model's focus on reducing perplexity means that it wastes modeling power on rare situations. It is good news that sparse reordering works much better than ME reordering - incorporating sparse features only slows the MT system down slightly, while ME reordering imposes substantial computational cost.
Stewart et al 2014
"Coarse 'split and lump' bilingual language models for richer source information in SMT" by Darlene Stewart, Roland Kuhn, Eric Joanis, and George Foster, AMTA 2014, Vancouver, Canada (http://www.iro.umontreal.ca/~foster/papers/coarse-amta14.pdf).
This paper has two central ideas: a not very innovative one that we (NRC) still routinely incorporate in most of our systems, and a more innovative & interesting one that has mainly been superseded by NNJM''''''s (neural net joint models).
The less innovative idea is called "coarse language models" ("coarse LM''''''s"). This is exactly like a normal N-gram LM for the target language, except that the basic unit is word classes, not words. In the paper, the open-source Mkcls() program is used to find the word classes from training data. Subsequently, we have often used a faster word clustering program, word2vec, to obtain the word classes, though the resulting improvement in BLEU is not as great as when the word classes are obtained with Mkcls(). The paper shows that one can often get bigger gains by using more than one coarse LM, at different levels of granularity (in addition to the normal word-based LM). For instance, one can get good results by using a coarse LM with 100 word classes and another with 1600 word classes, or another with 200 word classes and another with 800 word classes.
The more innovative idea is called "coarse bi''''''LM''''''s". Bi''''''LM''''''s are an earlier concept from another lab. They are like normal N-gram LM''''''s, except that the basic unit of a bi''''''LM is not a word in the target language, but a unit called a "bitoken" consisting of a word in the target language together with the source-language words to which it is aligned (that could be zero, one, or more than one source-language words). Incidentally, bi''''''LM''''''s ignore source-language words that don't align with target-language words. The advantage of using bitokens instead of words is that word senses in the target language are often disambiguated. E.g., if the target language is English, the word "shower" is ambiguous. However, the English-French bitokens "shower_''''''douche" and "shower_''''''averse" (using a notation of the form ''target-word_''''aligned-source-words'') are less ambiguous: the first bitoken refers either to a bathroom fixture or to an act of cleansing performed with that fixture, while the second bitoken refers to rainfall.
The lab that invented bi''''''LM''''''s didn't succeed in getting them to have much impact on MT performance. That is because of their main disadvantage: it is hard to estimate them accurately, because of data sparsity - for any given bilingual corpus, there will be far more bitokens than target-language words. The key NRC innovation in this paper was to reduce data sparsity by applying Mkcls() to bitokens. The paper explores four ways of doing this: by clustering source-language words into classes, by clustering target-language words into classes, by clustering words in both the source and target languages separately and combining the classes on both sides, and finally, by clustering the original word-based bitokens. These various ways of keeping the central idea of bi''''''LM''''''s - using units for an LM that have information about the source words aligned with each target word - while reducing data sparsity are lumped together under the name of "coarse bi''''''LM''''''s" in the paper.
When multiple coarse LM''''''s and coarse bi''''''LM''''''s are added as features, the paper shows impressive gains over strong baselines for some language pairs - for instance, up to +1.2 BLEU for Arabic > English MT (using BOLT data). In retrospect, coarse bi''''''LM''''''s do some of what is achieved by NNJM''''''s: they are LM''''''s that incorporate more information from the source than was previously possible.
However, honesty compels the admission that coarse bi''''''LM''''''s have weaknesses that NNJM''''''s don't share:
- In coarse bi''''''LM''''''s, the span of information from the source sentence is limited by the number of words in the target-language history being considered. For instance, a 3-bitoken coarse bi''''''LM can only consider the source words aligned with the current target word, or with the two preceding target words. In an NNJM, the span of source information can be much wider than that considered on the target side, and source words "in the future" - i.e., that haven't yet generated words in the target-language hypothesis - can be considered. In the original Devlin et al. (2014) paper on NNJM''''''s (see below), the information exploited by an NNJM is often merely a 4-gram on the target side (the current target word and its three predecessors), but a span of five words to the left and five words to the right of the source word aligned with the current target word.
- In coarse bi''''''LM''''''s, the procedure for reducing the sparsity of bitokens is separate from the way those bitokens will be used in the resulting bi''''''LM: there is no feedback from the training of the bi''''''LM to the Mkcls() function. NNJM training implicitly combines data compression with modeling in a feedback loop. It seems likely that this makes NNJM''''''s more efficient at the simultaneous exploitation of target and source information than coarse bi''''''LM''''''s.
We have never carried out a full-scale experimental comparison of coarse bi''''''LM''''''s with NNJM''''''s. However, the considerations above have caused us somewhat to lose interest in coarse bi''''''LM''''''s and to focus on NNJM''''''s.
Devlin et al 2014
"Fast and Robust Neural Network Joint Models for Statistical Machine Translation" by Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul, ACL 2014, Baltimore, USA (http://acl2014.org/acl2014/P14-1/pdf/P14-1129.pdf).
This 2014 paper on a "neural net joint model" (NNJM) by a group of authors from Raytheon BBN is already a classic. It should be read by anyone interested in modern machine translation (MT).
There had been many earlier attempts to apply neural net (NN) techniques to MT. For instance, language models (LM''''''s) based on NN''''''s were already in use by some MT groups (and also some groups working on automatic speech recognition). These NN-based LM''''''s had not had a large impact on the field, perhaps because the modest improvements they promised were outweighed by the nuisance of trying to master the fiddly NN training procedure. Nor had other attempts to incorporate NN''''''s into MT systems been unusually successful.
Devlin et al. (2014) changed that. One of this paper's major contributions was its demonstration that incorporation of far more contextual information than before from the source sentence when predicting the next target word can result in dramatically improved translation quality. (Our group at NRC had found another way of incorporating wider source-side context, via a data structure called coarse bi''''''LM''''''s - see above - but coarse bi''''''LM''''''s have limitations that the Devlin et al. approach doesn't have). In this paper, the NNJM yields a +3.0 BLEU gain over a strong baseline in experiments carried out with NIST 2012 Arabic > English data.
The NNJM described here inputs the current target word and its three predecessors (just as a 4-gram LM would) along with a window of 5 words on either side of a single source word affiliated with the current target word (i.e., a source window context with a width of 11). The NN has a feed-forward architecture, with the 14-word context vector (3 target words, 11 source words) being mapped to a 192-dimensional vector using a shared output layer. Two hidden layers with a tanh activation function map this 192-dimensional vector onto a softmax output layer representing the target-language vocabulary. The 32,000 most frequent target-language words are predicted explicitly; the others are partially predicted (the system predicts their POS''''''s).
A serious potential problem with this NNJM is that during decoding, the output scores need to be normalized, which requires consulting the entire output layer at each step. That is computationally intensive. An important innovation of this paper is a method for training the NNJM in such a way that the output scores are close to being probabilities, so that normalization at decoding time becomes unnecessary. This training technique is called "self-normalization". The authors also pre-compute the first hidden layer of the NN, and show that using one hidden layer rather than two doesn't significantly hurt BLEU. Together, these innovations make it possible to incorporate the NNJM in an SMT system as a feature without significantly slowing down decoding.
There are four ways the NNJM can be used as a feature in an MT system like PORTAGE shared. The way that lends itself most naturally to decoding is scoring each proposed target word, going left-to-right in the target language; the authors denote this as a S2T/L2R model. Once a set of target hypotheses has been proposed, one can score the source sequence on the basis of a particular target hypothesis. One might also score each target hypothesis from right to left (rather than from left to right). Thus, there are four sub-types of NNJM: S2T/L2R, which can be used during decoding, and T2S/L2R, T2S/R2L, and S2T/R2L, all of wich can easily be applied during rescoring of N-best lists generally.
In addition to these four types of NNJM, the paper describes a fifth model that is not an NNJM: the "Neural Network Lexical Translation Model" (NNLTM). Recall that NNJM''''''s don't explicitly model source words that don't help to generate target words. The NNLTM repairs this omission. For each word in the source sentence, it predicts whether that word generates NULL or some sequence of non-NULL target words (to some extent, it performs the role of the "fertility" component of the old IBM models 3 to 6).
Initially, other MT research groups had some difficulty in replicating the dramatic BLEU gains from NN models described by the Raytheon BBN authors in this paper. NRC researchers were among the first to validate the approach taken here by incorporating it in a non-BBN system, PORTAGE shared, and then getting gains for PORTAGE shared as dramatic as those described in this paper. This paper does not describe all the fiddly details of training needed to make NNJM''''''s and NNLTM''''''s work well. Jacob Devlin, the first author of the paper (now at Microsoft) deserves our gratitude: he spent many hours on the phone and on email with NRC researchers helping them to get NNJM''''''s and NNLTM''''''s to perform well when incorporated in PORTAGE shared.
The following is a summary of how we at NRC train NNJM''''''s. Packed rest costs are an NRC innovation (they weren't one of the tips from Jacob Devlin).
Foster 2016 addendum to Devlin
George Foster: ADDENDUM to Devlin et al (2014), January 2016
General tricks for training NNJM''''''s:
- Learning rate decay: multiply by 0.5 when dev-set perplexity hasn't dropped for last two epochs (using BBN's epoch = 20K minibatches of 128 examples).
- Use a high initial learning rate (0.1 or higher, assuming gradient is normalized by minibatch size), but clip gradients to [-0.1, 0.1]
- For self-normalization, use a bias for output word embeddings, and initialize this to uniform probabilities (or, slightly better, unigram probabilities), so the network starts out self-normalized.
- For corpora that can be split into in-domain and out-of-domain, train initially on the whole corpus, then perform several "adaptation" iterations on the in-domain data. When adapting, start with 1/10 the initial learning rate for the main pass, and stop when no dev-set improvement has been seen for 5 epochs, to avoid overfitting.
- If your decoder requires probabilities for n-gram contexts shorter than 3 (eg, for future-cost estimates), the model can be efficiently trained to generate these by randomly dropping context prefixes of varying lengths. We go into more detail about this below.
Packed rest costs for the NNJM:
As described by Devlin et al. (2014), full decoder integration of an NNJM that uses 3 words of target-side context requires the training of 4 models: each accounting for a different level of available target side context (from a target 4-gram model to a target 1-gram model). The lower-order estimates are used to estimate language model probabilities before the appropriate context becomes available, typically when calculating phrase-pair future costs for phrase-based translation, and when estimating left- and right- completion costs for constituents in SCFG-based translation. These lower-order estimates are also referred to as rest costs in the language modeling literature. As NNJM training takes roughly 1 day, and we have a limited number of GPU machines, we investigated ways to train all 4 models at once.
We experimented with a drop-context approach to packing multiple amounts of context into an NNJM. The approach is simple: when training an NNJM with n words of target context, for every word visited during the stochastic gradient descent, we select it to have its context elided with probability ''p_''''elide'' (set to 0.1 in our experiments). If a word is selected for elision, it has between 0 and n-1 of its target context words replaced by a special elision token, with the number of words elided selected uniformly. A vector representation is learned for the special elision token just as if it were any other token. The vast majority of words get the full target context, but the system learns not to rely on this context, as any future example may or may not have truncated context. This dropped context will act as a sort of regularizer, forcing the source context and remaining target context to not overfit to strong signals from more distant target words. At test time, any desired level of context can be achieved by simply replacing the missing context with elision tokens. For example, a 1-gram NNJM has three elision tokens followed by the output word.
In a BOLT Arabic > English setting, we evaluated the packed NNJM against a version where specialist NNJM''''''s were trained for each level of context. The packed NNJM yields perplexity results comparable to those of specialized models at all levels of context except for no context (1-gram). Crucially, we have given up no power at the 4-gram level in order to pack these models together. We also evaluated the use of packed future costs in our phrase-based decoder. Again, we see only a small advantage for using specialized models, with nearly the same average model score. With the packed model requiring one fourth of the training time and one fourth of the space, and having no significant impact on performance, we recommend its usage in the calculation of future costs.
Up: PortageII Previous: ProgrammingGuidelines Next: RequirementsForTrainingAMidSizeSystem