chapter 9 Statistical Machine Translation and Word Alignment - yanghaocsg/machine_translation GitHub Wiki

chapter 9 Statistical Machine Translation and Word Alignment

Since the late 1990s, aligned bilingual corpora have been the object of various studies aiming to extract translational equivalencies between languages at the word or phrase levels. For example, a popular task of that decade was to extract bilingual lexicons for human translators from available parallel texts. Attempts to produce entirely automatic translation systems through statistical analysis took place simultaneously—the approach that remains most popular today and has given rise to the most pioneering research in the field.

Word alignment is a considerably more complex task than sentence alignment, as one can imagine. While there is often a 1–1 correspondence between the source text and its translation at sentence level (in other words, one sentence from the source text most often corresponds to one sentence in the translation), this does not necessarily apply at word level. It is well known that languages differ significantly and that many words cannot be directly translated. Most correspondences are therefore said to be “asymmetrical”; that is, one word from the source or target language corresponds to 0, 1, or n words in the other language.

Some Examples

Consider the following example: “Thanks to those in the field for their insights,” translated as “Merci à tous ceux qui, sur le terrain, ont fait part de leurs idées” (taken from the website www.unaids.org). The English sentence contains nine words, whereas the French equivalent contains 14! It is thus difficult to propose an alignment at word level, since the two sentences do not have the same structure (the French version introduces a relative clause, whereas English uses simpler and more direct wording). Figure 11 presents an example of an incomplete lexical alignment between the two sentences (incomplete in the sense that some words do not have translational equivalents).

Figure 11 A possible alignment between two sentences.

In this example, the equivalent of the preposition “for” is in some ways “qui ont fait part de”: we could imagine relating “for” to all the words in the French expression, but this would hardly make any sense. This is also very true for the following example.

Figure 12 A possible alignment between two sentences, with several intersecting links.

In figure 12, several words from the source sentence are not directly translated in the target sentence. Furthermore, we see that links between the two sentences intersect several times due to the inversion of the two propositions (the English sequence “…that what he has announced he will actually do…” appears in French as “…qu’il fasse réellement ce qu’il a annoncé”). One word from the source language can furthermore correspond to several words in the target language, and vice versa (“will see” corresponds to the simple verb form in the future tense “veillerons,” whereas, in the opposite direction, a link can be established between the simple word “what” and the French expression “ce que”). Lastly, the end of the sentence is difficult to “align” correctly: it would be more satisfactory to draw a link directly between the phrase “the need for it becomes apparent” and “la nécessité s’en fait ressentir,” because it is hard to see how this very good translation could be decomposed (the phrase “the need” can probably be aligned with “la nécessité,” but nothing in French can be considered a direct translation of “becomes apparent”).

In sum, identifying lexical equivalences is a difficult task to automate because the “search space” (i.e., the number of possibilities to be considered) is huge: each word from the source language can potentially be linked to any single word or group of words in the target language. This is, of course, not true when a human being is performing the task on known languages, but imagine how complex it would be with completely unknown languages! This is the case for computers, which do not have any idea of syntax or semantics and do not have access to a lexical resource like a dictionary. From a linguistic point of view, the task seems even more questionable, since one thing we know about translation is that there are no direct equivalences between languages at word level. The proof is that a word-for-word translation is generally very bad. For the most part, these ideas are right, and we will see that for the past several years, efforts have focused on taking into account more complex sequences of words in machine translation in order to avoid the basic errors that arise from the word-for-word approach.

Nonetheless, toward the end of the 1980s, the statistical approach based on sentence alignment at the word level led to remarkable progress for machine translation. This approach naturally takes into account the statistical nature of language, which means that the approach focuses on the most frequent patterns in a language and, despite its limitations, is able to produce acceptable translations for a significant number of simple sentences. In certain cases, statistical models can also identify idioms thanks to asymmetric alignments (one word from the source language aligned with several words from the target language, for example), which means they can also overcome the word-for-word limitation.

In the following section, we will examine several lexical alignment models developed toward the end of the 1980s and the beginning of the 1990s. The goal of this approach is to use very large bilingual corpora to automatically extract bilingual lexicons. In these lexicons, different translations are proposed for each word, and each of these translations is assigned a score reflecting its probability of being a correct translation. These lexicons are a fundamental part of machine translation systems, as they provide the basis for a word-for-word translation.

The “Fundamental Equation” of Machine Translation

At the end of the 1980s, an IBM research team located in Yorktown Heights, New York, decided to develop a machine translation system based on techniques initially developed for speech transcription. Speech transcription refers to the task of producing a written text from a sound sequence. Translation can be seen as a similar task, the only difference being that the input signal is a sequence of words in the source language instead of a sound sequence.

The IBM experiments were described in a series of papers published at the end of the 1980s and the beginning of the 1990s (see Brown et al., 1988, 1990 and 1993). The authors take as a starting point the fact that there are always several possible translations for a given sentence, whatever the source and target language may be. The choice among these various possibilities is to some extent a matter of taste and personal choice. Bearing this in mind, one can consider that any sequence in the target language can be considered a translation, to a certain extent, of a sequence in the source language. Given a pair of sentences (S,T), where S is the source sentence and T the target sentence, it is possible to calculate a probability Pr(T|S) that a human translator would produce translation T from the sequence S. The idea is that Pr(T|S) will be very small for a pair of sentences like (Le matin je me brosse les dents | President Wilson was a good lawyer) and a lot higher for a pair such as (Le président Wilson était un bon avocat | President Wilson was a good lawyer). In other words, every translation in the target language can be considered a translation of a sentence in the source language, but realistic translations will obtain a score well above 0, whereas the others will remain close to 0.

The IBM team showed that this hypothesis can be modeled using well-known principles from probability theory, namely Bayes’ theorem. Bayes’ theorem in a way reverses the problem and aims to determine, given various sequences from the target language, which has the most chance of being a translation of the source language. This can be formalized with the following equation: ,(1) where Pr(T) is a language model of the target language and Pr(S|T) a translation model. In other words, Pr(S|T) measures the probability of sequence S according to sequence T (meaning the probability that S is a sequence in the source language that corresponds to sequence T; if the probability is close to 1, then the two sentences are probably a translation of one another), whereas Pr(T) measures the probability of the sequence in the target language, without taking the source language into account (i.e., the probability that T forms a valid and well-formed sequence in the target language, thus accounting for word order in the target language).

For example, Pr(T) encodes the fact that the sequence “the red car” is more frequent than “car the red,” “car red the,” or “red car the.” The translation of “la voiture rouge” will then be “the red car” rather than any other sequence formed with these same words. We will see later that the translation model Pr(S|T) makes the translation process possible by breaking down the sentence into small fragments to search for equivalences at word level. When all word-for-word equivalences are put together, different sentences are possible that differ only in their word order. Pr(T) helps “select” among these solutions the sequence that has a chance of being the most correct in the target language, taking into account only considerations of word order. Since the denominator of equation (1) does not depend on T, the equation can be simplified as follows: T’ = argmaxT Pr(T) * Pr(S|T)

For the IBM team (see Brown et al., 1993), this formula is the “fundamental equation of machine translation” because all statistical models thereafter originate from it. It should be noted that the equation does not explain how to decompose the source sentence during the translation process. The easiest way is for the model to rely on words and perform a word-forword translation, at least in a first approximation. In this context, the idea is to find, for each word in the source language, an equivalent in the target language using a very large aligned corpus at sentence level, as we saw in chapter 7. In order to do this, the IBM team proposed to decompose the statistical translation process into three different steps:

determine the length of the target sentence depending on the length of the source sentence;
identify the best possible alignment between the source sentence and the target sentence; and then
find correspondences at word level (i.e., find word mt in the target language corresponding to word ms in the source language).

This strategy clearly gives a very simplified picture of the translation process. In particular, the first step assumes that every sentence of length l in the source language will be translated by a sentence of length m in the target language! In fact, at the end of the 1980s, IBM was fully aware of the limitations of the approach: various articles published at the time stressed the fact that such an approach would have to be complemented by efforts to incorporate more linguistic knowledge and more complex matching rules between languages. For the IBM team it was even probable that this approach would not make it possible to go very far given its inherent limitations. But IBM sought to evaluate the quality of the results obtained with a very simple approach compared to more complex systems obtained after years of human effort. We will soon see that the results obtained were incredibly good from this point of view.

The essential part in the IBM model thus resided in the choice of words for the translation in the target language, which means that the essence of the overall model can be found within the lexical alignment strategy (i.e., alignment at word level). The overall approach comprises two very different steps. The first one aims at extracting as much information as possible from very large bilingual corpora; the second one uses this knowledge to translate new sentences. More precisely, the approach can be described as follows:

A word alignment algorithm is applied to a very large corpus made of bi-texts (texts aligned at sentence level). The result of this analysis is twofold: a bilingual dictionary (i.e., the result of the alignment at word level) and the most likely global alignment at sentence level.
This huge amount of information is then used to translate new sentences that the end user wants to translate.

The first step is often called the “training step” or “learning step,” and the second step the “processing step” or “testing step.” It is important that the data used for training be similar to the data used for testing in order for the system to produce satisfactory results. As one can imagine, the key point lies in the quality of the information accumulated during the training step, which essentially entails analyzing a very large aligned corpus at word and sentence levels. The seminal paper from IBM in 1993 described five alignment models, each of which is a modification of the previous model.

Different Approaches for Lexical Alignment: The IBM Models As we have seen, the translation approach developed within IBM in the late 1980s was essentially based on translation choices carried out at word level. So, the crucial ingredient in this approach is an accurate bilingual dictionary. With the statistical framework, a bilingual dictionary consists in fact, for each word, of a list of possible translations in the target language, along with a probability associated with each of these possible translations. For practical reasons, the sum of the probabilities of all the possible translations associated with a given word in the source language must equal 1, as shown in table 1.

Table 1 Example of possible translations in French for the English word “motion”

English word Possible translation Probability
motion mouvement 0.35
geste 0.12
motion 0.11
proposition 0.10
résolution 0.10
92
marche 0.05
signe 0.04
… …
Total = 1

Note: Each translation appears with a probability based on the number of times the word was actually translated in this way (compared to the total number of occurrences of the word in the corpus). Thus, “mouvement” is the most likely translation, followed by “geste,” etc. This list has been limited arbitrarily to the first seven possible translations, but it is theoretically possible to list as many as necessary, as long as the sum of the probabilities ultimately equals 1 for one given word in the source language.

The different IBM models are in fact not limited to 1 to 1 correspondences at word level, but assume that a source word may be aligned with 0, 1, or n target words. The different models (numbered 1 to 5) include different optimizations to deal with multiword expressions in the target language, or with words with no equivalent in the other language (for example, determiners that appear in one language but not in the other one). We give a quick overview of the various models below, without all the mathematical details.

Model 1

The first model developed by IBM was extremely simple. It considered that initially, by default, any word from the target language could be the translation of any word in the source language (within two sentences that are translations of each other, taken from a given bitext). This starting point may seem too crude, but one should keep in mind that the system initially does not have any linguistic knowledge (no dictionary is provided) and will base its analysis on very large corpora (several millions of aligned sentences are used in most systems nowadays). To illustrate the approach, in what follows, we take the example of a single, isolated sentence, but it should be borne in mind that the approach can only work if regularities are identified from millions of examples.

In order to roughly determine the probability that a target word mt is the likely translation of a source word ms, one could collect all the words appearing in all the translations of sentences where ms appears and then calculate the translation probability of each word from the relative frequencies of the words collected this way. This means, intuitively, that in the absence of any linguistic knowledge, the system assumes that all the words in the target sentence are possible translations of all the words in the source sentence. Thus, for the sentence pair “the cat is on the mat” ⇔ “le chat est sur le paillasson,” the six French words “le,” “chat,” “est,” “sur,” “le,” and “paillasson” will be considered as equally probable translations of “cat,” as of any other word in the English sentence. Clearly, this strategy does not work for one sentence in isolation (“paillasson” is not a proper translation for the English word “cat”), but the analysis of a huge number of sentences will reinforce the association “chat” ⇔ “cat” (since these two words very frequently appear together in bi-texts), whereas the association between “cat” and “paillasson” will remain very marginal (meaning that in the end the association will have a probability close to 0) and as a result will likely be ignored thereafter.

However, this simple solution poses a major problem: if the target sentence contains 20 words, each of them will be a possible translation that will have the same weight as if the target sentence contained five words. Yet, as there is only one single equivalent mt to find for each ms in this case, it is obvious that a longer sentence will lead to more noise (meaning it will generate more erroneous possibilities) than a shorter sentence. In other words, the number of words in the sentence should be taken into account in order to increase the probability of each of the five words in the shorter sentence.

Simultaneously, the system also considers the global probability of all word alignments at sentence level: the reinforcement of some connections at word level (like between “cat” and “chat”) will reinforce some possibilities at sentence level, and vice versa, as will be described below.

Following this principle, IBM defined a process that uses a classic learning algorithm called the expectation-maximization algorithm, or EM, which gradually calculates the probabilities associated to each pair ms-mt, as well as the probabilities associated to each possible alignment at sentence level. As already noted, these two probabilities depend on and gradually reinforce each other. The EM algorithm calculates this joint probability in two steps: (i) arbitrary initial values are first assigned to each of the parameters (typically, every word in the source sentence can be connected with every word in the target sentence with the same probability); (ii) the system then calculates in an iterative manner the probabilities of the overall alignment at sentence level and then again at word level until convergence is achieved (the process is iterative since the alignment at sentence level changes the probabilities at word level, and vice versa, until the system reaches a stable state).

Let us take an example. Each alignment and each lexical correspondence has the same probability at the beginning. The fact that two words appear regularly together in bi-texts (in a source sentence and in the corresponding target sentence) will gradually strengthen their probability of being a translation of each other, as well as the probability of possible alignments at sentence level where these two words are connected. The figures below (figure 13 to figure 16, based on Koehn, 2009) demonstrate clearly the alignment process at word level.

Figure 13 Initialization of the alignments. Each English word is linked with equal probability to all the words in the French translation.

Figure 14 After the first iteration, the algorithm identifies the link between “la” and “the” as being the most likely, based on their frequency in the source language and in the target language. These links are strengthened (shown in bold) to the detriment of other links and therefore also other possible alignments.

Figure 15 After another iteration, the algorithm identifies the other most probable links between “voiture” and “car,” then between “chaise” and “chair” and between “red” and “rouge.” The other possible links and alignments become less and less probable. Figure 16 The process ends when there is convergence, meaning a stable structure has been found. The other links in the figure are removed, but in fact they remain available with a very low probability.

It is possible to filter the alignment using a threshold in order to select only a limited number of possibilities, as shown in this figure, where the alternative links have been completely deleted. The reader is referred to IBM’s initial publication (Brown et al., 1993) for all the mathematical details. The application of such an algorithm to very large corpora requires being able to control memory management and complex computational methods. Moreover, our presentation is simplified here: in practice, not only 1–1 word correspondences are considered but also 1 to m (when one word in the source language corresponds to m words in the target language) which makes the problem even more complex. Finally, it should be noted that the algorithm guarantees that the result is optimal (i.e., there is convergence and the system always stops after a certain number of iterations, which is not always the case by default with the EM algorithm).

The other models are all derived from this initial model. They make it more complex in order to take better account of certain language particularities and to provide better translations.

Model 2

As we have seen, IBM Model 1 considers that all initial alignments at word level have an equal probability (i.e., all the source words ms can be linked to all target words mt with an equal probability at the beginning of the alignment process). This is clearly wrong: one can easily observe that word order is often roughly similar between different languages, especially between typologically related languages (such as French and English). This is, of course, not always true, but there is nonetheless a strong correlation between the word order in the source and target languages in this case.

Model 2 thus modifies Model 1 by taking into account in its calculation the relative position of word mt in relation to the source word ms. This does not fundamentally change the previously mentioned algorithm, but leads to better results and speeds up the learning process before reaching convergence.

Model 3

Model 3 is notably more complicated than Model 2. Its goal is mainly to better formalize the question of the 1-n correspondences (where one word in the source language is translated with many words in the target language; see, for example, “potato” in English, which corresponds to “pomme de terre” in French). This issue is not addressed by the previous models but is very common when translating between any languages. IBM Model 3 also addresses other related problems: the article “the” is often translated as one word in French (“le,” “la,” “les,” “l’”) but is also often omitted; “only” may be translated as “seulement” but also by the expression “ne… que” (two non-contiguous words), etc.

The IBM team thus proposed to add to Model 2 “fertility probabilities” that indicate, for each word in the source sentence, the possible number of words in the target sentence (by default, each single word is translated by another simple word, but for “potato,” the translation is “pomme de terre,” which means there are systematically three words in French in this case; for “the,” 0 or 1 word can be generated, etc.).

A related process aims at extrapolating a number of semantically empty words in the target sentence. This solves one of the limitations of previous models, since they always need a word in the source language in order to be able to generate a word in the target language. For example, to translate “il est avocat” into “he is a lawyer,” the English article “a” must be added, which was not possible with the previous model (IBM Model 2). This problem is solved through “distortion probabilities,” which allow for empty positions during alignment in order to properly generate these new words in the target language.

Model 4

The IBM team then observed that some parts of a sentence can be moved more or less freely, and that this may be the cause of structure variation between a sentence and its translation. A pair of sentences may have the same structure in the source and the target language, but differ because one phrase has been moved (see, for example, “He has lived in New York since last year” vs “Il habite depuis l’année dernière à New York”: the two sentences have the same structure, except that “in New York” / “à New York” is at the end of the sentence in the French version).

The previous models, mainly based on correspondences at word level, were not very good at tackling this kind of problem. Model 4 therefore modified the distortion probabilities proposed in Model 3 to account for these blocks of text that can move within the sentence.

Model 5

Model 5 did not make any fundamental modifications to Model 4, but made it possible to avoid irrelevant sequences that would be otherwise considered given the way the problem is formalized. Model 5 is more accurate mathematically, but the calculations required are in fact much more complicated and the results are similar, or even worse than those from the previous model since Model 5 requires more training data. In brief, this model can be ignored here, as it mainly concerned calculation issues and added nothing new from a linguistic point of view.

The Translation (or Processing) Stage

At this point, we need to remind the reader that with the statistical approach, the translation process consists of two fundamental steps. The system first uses a very large corpus of bi-texts (texts aligned at sentence level) to automatically acquire information about the translations of words and about the possible alignment of all the words at sentence level. This is what we have described in the previous section (this phase is often called the “training” or “learning” phase, or even “encoding phase,” since this stage involves encoding information about the language).

The knowledge extracted from the bilingual corpus is then used during the processing phase (also known as the “test” or “online” phase) to translate new sentences submitted to the system. This stage is also known as the “decoder,” since the system tries to “decode” the input sentence as one would decode a secret message. Each time a new sentence in the source language is submitted as input, the system splits the sentence into words, looks for the most likely translation for each word, and takes into account the constraints on word order as given by the translation model.

The language model makes it possible to evaluate the probability of different candidate translations in the target language. This is crucial for the quality of the result: thanks to the language model, it is possible to consider translations that are not necessarily based on the most probable equivalences at word level. The translation of a sentence made only by concatenating the most probable translation of each single word may have a very low probability as a whole, whereas a sequence of words including translational words with a lower probability may have, at sentence level, a higher probability. Let us take an example. The sentence “The motion fails” contains the word “motion,” for which the most probable translation is “mouvement” (see the previous section). By default, a translation (based uniquely on the most probable words) would therefore be “le mouvement est rejeté,” which does not mean anything in French. The translation “la motion est rejetée” is much more probable (even if “motion” is less probable than “mouvement” for the English word “motion”). This translation is the correct one, which is rightly predicted by the model.

Finding the best translation actually involves sorting through a multitude of possible choices, each word having multiple translation equivalents, or even not being translated into the target language. The module known as the “decoder” is responsible for finding the best translation possible. Its role is to find the solution that maximizes the score at sentence level, taking the translation model and the language model into account.

The techniques employed to solve this kind of problem can be relatively complex from a computational point of view. The goal is to gradually eliminate the least probable local hypotheses to efficiently converge toward the most probable global solution. This type of algorithm is not specific to machine translation, but is already frequently used for speech analysis. As for speech, finding a good translation involves calculating an optimal score from a very large number of partial analyses that overlap and are often incompatible with one another.

Back to the Roots of the Domain?

The IBM models are in some ways a return to the roots of the domain, since the techniques proposed very directly echo several of Weaver’s 1949 propositions. The fact that the module aiming at producing the translation is sometimes called a “decoder” is no coincidence. This is a reminder of the general model of communication, and the goal is to “decode” the source text by translating it into the target language (see chapter 5).

The models designed by IBM achieved phenomenal success. They were revised, modified, and improved. They remain the basis of most machine translation systems used today, although all the major players in the field are now moving towards a new kind of techniques called deep learning (see chapter 12).

However, these models also have their own limitations, the main one being that they require huge quantities of data to achieve reasonable performance. In the following chapter, we will take a look at recent developments around these models, but also at the situation of rare languages for which not enough data are available to achieve accurate statistical translation systems.