chapter 13 The Evaluation of Machine Translation - yanghaocsg/machine_translation GitHub Wiki

chapter 13 The Evaluation of Machine Translation Systems

As we have seen, translation systems have been the subject of intensive research since the renewal of the field during the 1990s following IBM’s experiments, described in chapter 9. The development of the web drove the main Internet companies to look into the problem, which also helped revive research. The question then arose of how to measure the quality of the systems. How can two systems be compared? How can the development of a single system over time be measured and its improvement tracked?

Additionally, we saw in chapter 2 the difficulty in defining what makes a good translation. It is thus clearly difficult to evaluate the quality of a translation, since any evaluation involves some degree of subjectivity and strongly depends on the needs and point of view of the user. The IBM team, in the seminal 1988 article (see chapter 9), quickly raised the issue by mentioning literary translation. The last word in Proust’s In Search of Lost Time is the same as the first word of the first volume (the novel actually begins with “longtemps” and ends with “temps”). Literary translators must focus on these types of details, which are fundamental for the interpretation of a work, but IBM immediately dismissed the problem by making it clear that machine translation has nothing to do with literary translation. Therefore the IBM team did not address these kinds of details, which exceed the scope of current research.

It is clearly difficult to evaluate the quality of a translation, since any evaluation involves some degree of subjectivity and strongly depends on the needs and point of view of the user. Despite the difficulties we have mentioned, it appeared necessary to devise some evaluation methods that are reliable, quick, reproducible, and if possible inexpensive. To do this, specific evaluation datasets were produced and evaluation campaigns were organized.

The First Evaluation Campaigns Since the beginnings of machine translation, evaluation has been perceived as necessary, more so than in other fields of natural language processing, probably because machine translation was seen from the beginning as an applicative field and very concrete results were expected. We have seen in this regard that the ALPAC report was very negative and rather skeptical about the quality that could be hoped for from such systems (see chapter 6).

At the beginning of the 1990s, with the renewal of research based on a statistical approach originally proposed by IBM, the need to measure machine translation systems was again felt. As is often the case in the field of natural language processing, it was an American funding agency, the Advanced Research Project Agency (ARPA, later known as DARPA1), that initiated research in this area. A 1994 article (White et al., 1994) reviewed the first attempts at evaluation from the beginnings of research on machine translation. The article specifically reported the various possible strategies and their limits, described below.

Comprehension Evaluation

To assess comprehension, professional human translators first translated English newspaper articles into different languages. Machine translation systems then translated the text back into English, and human analysts answered “multiple choice questions about the content of the articles” to evaluate the automatic translations, as explained by White and his colleagues. The number of questions the reader of the translation was capable of answering correctly determined the quality of a system. Because the first campaigns focused on a limited number of systems capable of translating into English from different languages, this method was well suited to the task: the text to be translated was provided in various languages, and the translations into English could then be compared. This test was initially named “direct comparability” because it was supposed to allow for a direct comparison of different systems from different source languages.

White et al.’s review of this kind of evaluation was rather mixed. The translations provided by the human translators, although they were supposed to be translations of the same text, were in fact all different and may have posed specific problems for a machine translation system. It was therefore difficult to know if the comprehension errors were to be attributed to the way the original text was phrased or to the translation system itself (not to mention the potential problems related to the interpretation of the text by the reader in charge of the evaluation). The method was eventually abandoned as a means of overall evaluation, but was, however, kept for evaluating the “informativeness” of the text with regard to the original text.

Evaluation Panel

The most obvious way to evaluate the quality of translations is to appeal to human judgment, notwithstanding the great degree of subjectivity inherent in human judgment. DARPA resorted to this in the early 1990s: the judges had to evaluate the quality of the translation produced, taking into consideration the lexical, grammatical, semantic, and stylistic aspects of the translated texts. As White and his colleagues pointed out, this method seemed attractive in that it also served to evaluate the quality of human translations. However, this type of evaluation encountered two major difficulties. First, from a practical point of view, it was very difficult and expensive to bring together a group of experts for the entire duration of an evaluation campaign. More importantly, the types of errors in the texts produced automatically were so diverse that it was extremely difficult for an expert to assign an overall score to a text (in practice, this score varied enormously from one expert to another, depending on the importance attached to such or such a type of error by a given expert).

Despite various attempts to homogenize notation strategies, wide variations between the scores assigned by experts remained. This evaluation method was thus not judged fully satisfactory, and the quality panel evaluation method was abandoned.

Adequacy and Fluency

After the previous attempts involving human experts, DARPA then resorted to two evaluation scores: adequacy and fluency. As White and colleagues described of this machine translation (MT) evaluation method: “In an adequacy evaluation, literate, monolingual English speakers make judgments determining the degree to which the information in a professional translation can be found in an MT (or control) output of the same text.” These pieces of information were generally fragments “containing sufficient information to permit the location of the same information in the MT output.” The fluency score aimed to verify correct sentence formation, the task being “to determine whether each sentence is well formed and fluent in context.” These criteria proved easier to use than those previously mentioned and became the standard set of methodologies for the DARPA MT evaluation. However, these measures remained subjective, and it has been shown that the scores assigned by experts still varied significantly.

Human-Assisted Translation

A final evaluation strategy takes as a starting point the fact that no automatic system is capable of producing a perfect translation. It therefore seems relevant to evaluate to what extent a machine translation can help a human translator obtain a good translation. The experiments conducted in the early 1990s involved a novice human translator, who was supposed to derive greater benefit from an imperfect translation than an experienced human translator (who was assumed to be more capable of seeing how to “properly” translate a sentence without the help of an automated process). The evaluation focused on the comparison between the results of the automatic process with the results of the improved translation done by the translator.

White et al. (1994) reported that this type of evaluation seemed to give interesting results. Nonetheless, several factors made it very difficult to use in practice. First, it was very difficult to control for the “beginner” status of the human translator. There is a great deal of variation from one individual to the next, which makes any comparison very subjective. Secondly, the added value of the various components of the automated system (especially the components managing the interaction with the translator, which were not directly part of the machine translation system) was difficult to assess. Finally, the majority of the automated systems evaluated already included modules that required some kind of interaction with the user, which made the result of the purely automated translation system difficult to isolate.2

In the mid-1990s, three measures were mainly used for evaluation: comprehension, adequacy, and fluency of the generated text. These three measures were interesting but relied largely on human judgment, which is known to be costly and partially inconsistent. This led experts in the field, toward the end of the 1990s, to try to find entirely automatic measures without human intervention.

Looking for Automatic Measures

Automatic evaluation measures aim at answering a simple question: given one (or several) reference translation(s), how can the quality of an automatic translation be measured? Similar questions arose around the same time—toward the end of the 1990s and the beginning of the 2000s up to the present day—in regard to automatic summarization, for example. While this question seems simple, finding an answer is obviously much more complicated. Several measures were defined; we briefly present the main ones below, without going into further mathematical details.

BLEU

The principles of the Bilingual Evaluation Understudy, or BLEU, score (Papineni et al., 2002) are relatively simple. The idea is to compare a reference translation TRef with an automatically produced translation TAuto. The BLEU score is calculated by truncating TRef and TAuto into segments of length 1 to n, called n-grams (it is generally assumed that the most reliable result is obtained when n = 4) and by comparing the number of segments shared between TRef and TAuto. The BLEU formula also includes a parameter that takes into account the length of the sentences in the automatically produced translation, so as to not favor systems producing too-short sentences.

If two texts, TRef and TAuto are identical, then the BLEU score is 1 (all segments from TAuto are also part of TRef). If no segment is shared, the score is 0. In other words, the closer TAuto is to the reference translation, the greater the number of shared segments will be, and the closer the BLEU score will be to 1. To improve the robustness of the result, it is possible to compare TAuto with several reference translations (TRef) without changing the overall idea.

NIST

The National Institute of Standards and Technology (NIST), an American organization organizing evaluation campaigns in various fields, developed the NIST score during the same period as the BLEU score (Doddington, 2002). It is based on the same principles: the two texts to be compared, TRef and TAuto, are truncated into segments (ngrams), and the measure is based on the number of segments from TAuto that also feature in TRef.

The main difference is the inclusion of an informativeness factor: the rarer a segment is, the higher its weight becomes. The NIST score is generally correlated with the BLEU score, which is logical given their broad similarity. The NIST score is meant to take better account of the informational diversity in the texts to be translated.

METEOR

The METEOR score (“metric for evaluation of translation with explicit ordering”; Banerjee and Lavie, 2005) was developed more recently and tries to better account for semantics. METEOR is based on the identification of semantically “full” words (essentially nouns, verbs, and adjectives) shared between the text to be evaluated and the reference. The idea is then to identify longer sequences of text around these full words that are shared between the two texts. As with other measures, the greater the number of shared segments, and the longer these segments are, the closer the METEOR score will be to 1.

The search for similar segments is not always based on surface forms. Words can be replaced by their stem or their lemma (changing “running” to “run”) or even by synonyms if a semantic resource (such as Wordnet) is used. This makes the method more reliable and more robust, but requires adequate semantic resources, which may not be available for all languages. This is the reason why the packages implementing this measure are generally provided with a list of “supported” languages (languages for which such a resource is available).

The authors of METEOR report results that are better correlated to human evaluations than the BLEU or NIST scores. However, METEOR is more difficult to operate than other scores, and the different options (e.g., whether a given linguistic resource has been used or not during the evaluation) make the results more difficult to interpret and compare over time. METEOR is therefore less frequently used than the other scores, especially BLEU, which remains the most widely used measure despite its limitations.

Comments on Automatic Evaluation Measures

All the measures presented here rely on the comparison of short sequences of words (n-grams, with n varying generally from one to four words) between a reference text and an automatically produced translation. We have seen that some measures try to take into account richer information (lemma, synonyms) but most of the time the evaluation is just based on surface forms (i.e., words as they appear in the text). The reader may be surprised by the poverty of the information used for evaluation, which completely eliminates notions such as style, fluency, or even the grammaticality of sentences. Since evaluation simply takes into account short sequences of words, it is clear that a completely illegible text consisting of random and meaningless sentences could obtain a rather good score, provided that the sentences are made of sequences of four words shared with the text used as a reference.

These biases are well known, but are not as problematic as it may seem at first sight. Since the target is to develop operational systems, there is no incentive to pursue a system that only seeks to obtain good results without worrying about the quality of the text produced. In evaluation campaigns, the output of the system is made public, so a team that obtained good results with semantically preposterous texts would not gain any benefit from them.

More fundamentally, the gap between the complexity of the translation task and the relative simplicity of the automated evaluation methods reveals that evaluation is a real issue. It is difficult to formalize notions such as that of “a good translation,” since nobody knows how to define this, let alone notions such as coherence or style. The methods used for evaluation are thus rather poor but obtain decent degrees of correlation with evaluations performed by human experts, which is considered the crucial factor for the task.

The Proliferation of Evaluation Campaigns

While machine translation was a moderately active research field during the 1980s, the renewal that took place following IBM’s work during the 1990s contributed to the increased number of evaluation campaigns. Since the early 2000s, several campaigns have been organized each year throughout the world.

DARPA has organized evaluation campaigns from Chinese and Arabic into English since 2001. The texts used for evaluation are stories from news agencies. Each year since 2005, the Workshop on Machine Translation (WMT) conference has also organized an evaluation campaign concerning certain European languages; for example, in 2013, the evaluation focused on the following language pairs: French-English, Spanish-English, German-English, Czech- English and Russian-English). For each pair, both directions of translation are evaluated (for example, French to English and English to French). The organizers provide participants with an initial collection of texts for development (generally a collection of aligned sentences that participants can use to adapt their system to the task), but participants can also use their own data (other corpora, bilingual or unilingual dictionaries, etc.). Upon submitting their results, participants must say whether or not they have simply used the provided data for the evaluation or if they also used other resources.

The European Commission has strongly supported the WMT evaluation campaigns from the beginning. The WMT is largely based on the availability of the Europarl corpus, which contains the transcriptions of the European parliament debates. The corpus, available in 21 languages, is specifically dedicated to machine translation: texts are aligned semiautomatically with great precision. It is an incomparable learning corpus for the development of automatic systems (see chapter 7).

It should be noted that the WMT campaign is not interested only in the evaluation of systems. A task is specifically dedicated to evaluation measures; the quest for automatic measures more closely correlated with manual evaluation remains a research area. More recently, the evaluation of the textual quality of the automatically generated translations also appeared to be a major concern, since traditional evaluation methods based only on small fragments of texts (“ngrams”) leave completely open the question of the quality of the produced text, as well as its readability.

Lessons Learned from Automatic Evaluation

Automatic evaluation is important to measure the performance and evolution of systems over time. Even if automatic measures are not completely satisfactory, they make it possible to measure evolutions that generally correlate with the perception a human has of the overall quality of the systems. In other words, it has been shown that a system obtaining scores that improve over time produces translations that indeed seem to improve according to human experts. The great differences observed in the results obtained when translating among different language pairs should also be addressed. Several features may affect performance: for example, a limited amount of training data, morphologically-rich languages that are known to be harder to process automatically, or translations between genetically distant languages.

Measuring the Difficulty of the Task According to Language Pairs Figure 19 (Koehn et al., 2009) shows the result of an experiment on 22 European languages using the same basic translation system (Moses) and comparable training data for each language pair. The training data were the JRC-Acquis corpus, which consists of texts translated and aligned between 22 European languages. The kind of text and the quantity of data were therefore the same for each language taken into account for this experiment. The scores displayed in figure 19 are BLEU scores.

Figure 19 Performance obtained with the same standard statistical translation system applied over 22 different European languages. The translation system is based on the standard Moses toolbox, the corpus used is the JRC-Acquis corpus (see chapter 7), and the metric used is the BLEU score. Dark grey cells correspond to a BLEU score performance over 0.5, and light grey cells to a BLEU score performance under 0.4 (blank: between 0.4 and 0.49). Language abbreviations: bg: Bulgarian; cs: Czech; da: Danish; de: German; el: Greek; en: English; es: Spanish; et: Estonian; fi: Finnish; fr: French; gr: Greek; hu: Hungarian; it: Italian; lt: Lithuanian; lv: Latvian; mt: Maltese; nl: Dutch; pl: Polish; pt: Portuguese; ro: Romanian; sk: Slovak; sl: Slovene; sv: Swedish (note that et, fi, and hu are Finno- Ugric, mt is Semitic, and all other languages are Indo-European). Figure taken from Koehn et al., 2009. Reproduced with the authorization of the authors.

The scores are not significant by themselves, but their comparison is particularly instructive, since they reveal the difficulty of the translation task depending upon the languages under consideration.

While some language pairs would no doubt obtain better results if language specificities were taken into account, the goal of the experiment was precisely to emphasize differences between languages (through evaluation scores) when using a standard translation algorithm, such as the traditional IBM model, without languagedependent optimization.

The results obtained are interesting. For example, they show the difficulty of processing languages that are very distant from English (for example, Finnish, Hungarian, and Estonian all obtained poor scores, just like Maltese). A more thorough observation of the results shows that morphologically-rich languages are more difficult to translate, since they can add—or “agglutinate”—several morphemes at the end of a lexical form, such as case markers expressing the function of the word in the sentence (along with morphemes expressing possession, determination, etc.). Some Indo-European languages, such as Slavic languages or even German, although not considered as agglutinative, have a rich morphology and do not obtain very good scores. Verbs with a separable prefix and compound words are also difficult to process, which explains why German obtains poor results. For morphologically-rich languages, a proper syntactic analysis is necessary to provide an accurate translation; for example, one must know if a noun is the subject or the object in order to decide if it is the nominative or the accusative form that should be selected for the translation. Figure 20 also shows some simple examples for Finnish, a language in which it is possible to generate an almost infinite number of word forms from a simple word because various morphemes can be added to the basic word form.

Figure 20 Variations of the word “book” in Finnish, depending on its grammatical function.

Statistical methods can identify correct translations even without undertaking a deep syntactic analysis, especially with the “segmentbased approach.” A segment being a sequence of words, this approach directly takes advantage of the context (since the context is nothing more than the sequence of words around a given word) and avoids the problems of a purely word-for-word approach. The probability of finding a correct translation, despite everything, becomes inevitably weaker for a morphologically-rich language than for a language where the words vary little as a function of the context, primarily due to a heavy use of prepositions and determiners. Languages like English or French are called “analytical” or “isolating” languages, since they have little variation in terms of morphology and a complex system of prepositions. Chinese is also an analytical language, though Finnish is not, as we have just seen.

For the same reason, it is clear that there is a major bias in the evaluation procedure: evaluation is based on the number of sequences of words shared between an automatic translation and one or several reference translations. Agglutinative languages are thus clearly disadvantaged because, for this kind of language, morphemes are concatenated (i.e., “glued” or “agglutinated”) to the basic word forms. The result is that, for these languages, one long word may include several morphemes corresponding to many different types of linguistic information, whereas French or English can deliver the same information merely through a sequence of several small invariable words. This kind of sequence in French or English is then obviously the source of lots of relevant segments (“n-grams”) for evaluation. The disadvantage is thus twofold for morphologically-rich languages: analytical languages present more long sequences of words that are likely to improve the evaluation scores (such as frequent prepositional phrases, for example), whereas agglutinative languages present complex linguistic forms that are therefore difficult to analyze accurately.

Let’s now turn to the case of English. English has, without doubt, a very poor morphology (especially when compared to other languages), which contributes to the good scores obtained for this language. Most of the time, automatic systems do not have to calculate the right word form in context for English, since words vary little. The availability of very large amounts of data is also a considerable advantage, of course (calling to mind Mercer’s “there is no data like more data”; see chapter 7), but, beyond that, it is also the specificities of English, especially its poor morphology, that explain the good results obtained for this language. This naturally brings us to take a look at the errors produced by translation systems.

Typology of Translation Errors

There are very few studies proposing a typology of errors made by machine translation systems. Such a task is in any case difficult and subjective, partly because it depends on the language and on the translation system considered, and partly because the errors are difficult to classify and often vary.

Vilar et al. (2006) tried nonetheless to propose such a typology. Their typology included the following categories: unknown words (words in the source language unknown to the translation system), poorly translated words (wrong meaning, incorrect word form, badly translated idiomatic expression, etc.), word-order problems (problems related to the word order in the target language) and missing words in the target sentence. They show that such an analysis is possible in specific cases (especially when the language pair concerns closely related languages) and can help identify certain weaknesses in the system to resolve them later on (systematic word meaning error, etc.). This type of analysis is especially useful in the case of rule-based systems developed manually, because it allows the developer of the system to correct certain rules or formulate new rules when faced with the main weaknesses observed.

As for statistical systems, the sources of error are more widespread and much more difficult to correct since the systems are not intended to be modified manually. In practice, the system must be “retrained” with new data to have a hope of correcting the identified errors, but the procedure is cumbersome. Moreover, since training is done on very large quantities of data, errors cannot be corrected one by one, and the learning procedure cannot be fully controlled since the process is by definition global and automatic. It is thus hard to correct a specific error in the case of a statistical machine translation. Hybrid systems, as we have seen, try to keep the best of both worlds, making it possible to make generalizations from large amounts of data while keeping as far as possible the ability to make accurate and local corrections.

Finally, it should also be kept in mind that it is indeed the language pair that is the key variable: the types of errors depend, above all else, on the characteristics of each language considered, for the reasons outlined in the previous section (availability of large or small amounts of data for training, morphologically-rich or -poor language, etc.). Machine translation is sometimes criticized for more fundamental reasons: the techniques used in the field remain to a large extent very close to the text, meaning that the result will also be close to a wordfor- word translation (or phrase-for-phrase translation). However, we have seen some cases where a proper translation requires the analysis of the complete sentence and cannot simply be based on local equivalencies between words or phrases; see, for example, the case in chapter 10, where “the poor don’t have any money” is translated by “les pauvres sont démunis.” Kay (2013) cites the more complex case of “please take all your belongings with you when you leave the train,” which corresponds to the French “veuillez vous assurer que vous n'avez rien oublié dans le train.” These two sentences are semantically similar and are often heard in trains before arriving at a destination.

They adopt a different logic, however, with English insisting on the bags to take, and French on the fact of not forgetting anything. Kay considers that this type of translation is extremely frequent and is out of reach of automatic systems. While we may agree with Kay on this last point, it is perhaps not as fundamental as he claims. A correct translation could be found in French that is much closer to the English original sentence, for example: “veuillez vous assurer que vous prenez tous vos effets avec vous au moment de quitter le train.” This is the type of translation that an automatic system would aim for. More idiomatic translations (even literary ones) are a sign of human translation, but this is not necessarily the goal of machine translation systems.

Notes

The Advanced Research Projects Agency (ARPA) is an American agency founded in 1958 and responsible for the development of emerging technologies in the USA. The name of the agency has changed several times, and the agency is probably better known now under the acronym DARPA (where D stands for Defense), its name since 1972 (except between 1993 and 1996).
This can also be seen as somewhat contradictory: if the approach is designed as needing to be interactive from the beginning, it is not necessarily relevant to evaluate the translations produced completely automatically. It is, rather, the capacity of the system to provide relevant translational elements that should be evaluated. 144