chapter 12 Deep Learning Machine Translation - yanghaocsg/machine_translation GitHub Wiki

chapter 12 Deep Learning Machine Translation

Over the past several years, a new type of statistical learning called “deep learning” or “hierarchical learning” has emerged in the wake of neural networks. Neural networks were originally inspired by the biological brain: neurons transmit and process basic information, from which the brain is able to build complex concepts and ideas.

Artificial neural networks, like the brain, are supposed to be able to build complex concepts from different pieces of information assembled in a hierarchical manner. But, as outlined in Goodfellow et al. (2016, p. 13): “the modern term ‘deep learning’ goes beyond the neuroscientific perspective on the current breed of machine learning models. It appeals to a more general principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired.”

This approach has received extensive press coverage. This was particularly the case in March 2016, when Google Deepmind’s system AlphaGo—based on deep learning—beat the world champion in the game of Go. This approach is especially efficient in complex environments such as Go, where it is impossible to systematically explore all the possible combinations due to combinatorial explosion (i.e., there are very quickly too many possibilities to be able to explore all of them systematically).

The complexity of human languages is somewhat different: the overall meaning of a sentence or of a text is based on ambiguous words, with no clear-cut boundaries between word senses, and all in relation to one another. Moreover, word senses do not directly correspond across different languages, and the same notion can be expressed by a single word or by a group of words, depending on the context and language considered. This explains why it is impossible to manually specify all the information that would be necessary for an automatic machine translation system, but also why the translation task has remained highly challenging and computationally expensive up to the present time. In this context, deep learning provides an interesting approach that seems especially fitted for the challenges involved in improving human language processing.

An Overview of Deep Learning for Machine Translation Deep learning achieved its first success in image recognition. Rather than using a group of predefined characteristics, deep learning generally operates from a very large set of examples (hundreds of thousands of images of faces, for example) to automatically extract the most relevant characteristics (called features in machine learning). Learning is hierarchical, since it starts with basic elements (pixels in the case of an image, characters or words in the case of a language) in order to identify more complex structures (segments or lines in an image; sequences of words or phrases in the case of a language) until it obtains an overall analysis of the object to be analyzed (a form, a sentence). An analogy is often drawn with human perception: on the one hand, the brain analyzes groups of simple items very rapidly in order to identify higher-level characteristics, and on the other hand, it recognizes complex forms from characteristic features, and can even extrapolate a complex representation from partial information (this is essentially what happens in the case of the Necker cube, where the brain infers a three-dimensional representation from a twodimensional drawing; see figure 1 in chapter 2). In the case of machine translation, deep learning makes it possible to envision systems where very few elements are specified manually, the idea being to let the system infer by itself the best representation from the data. This was, in a way, already the idea with purely statistical models, but we have seen that in fact many parameters had to be adjusted manually. For example, five models were proposed in the early 1990s by IBM for machine translation, each model introducing a different manually defined representation to correct certain defects of the previous model. Deep learning, on the contrary, makes it possible, at least in theory, to learn complex characteristics fully autonomously and gradually from the data, without any prior human effort.

In the case of machine translation, deep learning makes it possible to envision systems where very few elements are specified manually, the idea being to let the system infer by itself the best representation from the data.

A translation system based solely on deep learning (aka “deep learning machine translation” or “neural machine translation”) thus simply consists of an “encoder” (the part of the system that analyzes the training data) and a “decoder” (the part of the system that automatically produces a translation from a given sentence, based on the data analyzed by the encoder). We have already seen this vocabulary for the traditional statistical approach (see chapter 9), but here the encoder and the decoder are based uniquely on a neural network, whereas traditional statistical approaches use a combination of modules (typically, a language model and a translation model for the encoder part) to be able to use different optimization strategies. In a neural network, each word is encoded through a vector of numbers and all the word vectors are gradually combined to provide a representation of the whole sentence. In a way, we can say that deep learning machine translation adopts a more traditional architecture than statistical machine translation, since the encoder can be seen as the analyzer of the source language, whereas the decoder generates the translation in the target language (as in Vauquois’ triangle; see figure 2 in chapter 3).

With deep learning, the simultaneous management of various types of information enables more reliable decision making. These models are said to be hierarchical, but they are in fact multidimensional, meaning that each element (word, phrase, etc.) is placed within a richer context. Following the famous motto “you shall know a word by the company it keeps” (from the British linguist Firth), the approach is based on the hypothesis that words appearing in similar contexts may have a similar meaning. The system thus tries to identify and group words appearing in similar translational contexts in what is called “word embeddings.” This approach makes the process a lot more general and thus more robust than what we have seen so far: it may not be a problem if a word is rare, since other words appearing in similar contexts may indicate a valuable translation. The fact that a word has different meanings is not a problem either, since it can belong to different embeddings, reflecting different contexts of use. A second characteristic of deep learning approaches is that these models are said to be “continuous.” It was already partially the case with statistical machine translation, since in this framework words can be considered “more or less” similar to each other (meaning that all pairs of words have a similarity score between 0 and 1). This representation seems cognitively more plausible than the one given, for example, by traditional synonym dictionaries: there are indeed plenty of cases where words have a “more or less” strong similarity without being strictly synonyms. The deep learning approach generalizes this idea, so that words, but also higher linguistic units, like phrases, sentences, or simply groups of words, can be compared in a continuous space, which makes the approach highly flexible and able to identify, for example, paraphrases.

Lastly, it should be noted that closely related words inside a sentence are also gradually identified and grouped together during the analysis. This is why the deep learning approach is said to be hierarchical, since it is able to discover structure (relations between words or groups of words) inside a sentence, based on regularities observed in the thousands of examples given to the system during training: although deep learning systems do not directly encode syntax, they are supposed to be able to automatically identify relevant syntactic relations.

In brief, rather than having different modules considering different parts of the problem at a time, the deep learning approach to machine translation considers directly the whole sentence without having to decompose it into smaller segments, and also considers all kinds of relations in context at the same time. The fact that these relations can be vertical (groups of similar words that can fill a position in a sentence) or horizontal (syntactically related groups of words in a sentence) makes the approach highly flexible and cognitively interesting, but also computationally challenging.

There have been in fact several generations of artificial neural networks (the approach has only recently become called “deep learning”). Neural networks were actually invented in the 1950s and were blooming again in the 1980s—but the computational power of the machines at the time did not allow for managing the complexity of the representations involved (Goodfellow et al., 2016, pp. 13–28). Even today, the training phase of such a system may last for days. Specific processors and programming techniques (GPU-accelerated programming) are used to speed up the process. The approach is also, in reality, a lot more complex and abstract than what we have just described. A context is, for example, encoded through a vector of numbers, each number representing a feature (an abstract property automatically discovered by the neural network from the regularities in the corpus), with the length of the vector corresponding to a predefined value. A recent evolution consists in adjusting dynamically this value so that more or less information can be encoded depending on the complexity of the task.

It should also be noted that the approach remains empirical, especially when it comes to defining the architecture of the neural network used (e.g., the number of layers in the neural network, the length of the vectors used) as well as other parameters (e.g., the way unknown words are processed); there is little theoretical basis for these choices, which are mainly based on system performance and efficiency. These systems are sometimes criticized as lacking theoretical foundation for this reason.

Nonetheless, deep learning is a real step forward and has enabled significant improvements in the field of image recognition, speech processing, and, more recently, natural language processing. Some researchers today are going so far as to challenge traditional disciplines such as syntax, because through deep learning it is possible to infer structure from the data. In other words, it would be better to let the system determine on its own the best representation for a given sentence!1 It remains, however, necessary to put these claims into perspective: probably because of the dramatic amount of variation in sentences, systems still frequently fail to recognize the overall sentence structure, which can lead to major translation errors. Still, deep learning opens a window toward the resolution of such problems, hence the great success of this technique among researchers in the domain.

Current Challenges for Deep Learning Machine Translation

Until recently, machine translation systems based on deep learning performed well on simple sentences but still lagged behind traditional statistical systems for more complex sentences. There were different reasons for that, as explained by the Google team working on the question: first, training neural networks for the task is still difficult due to their complexity, especially the number of parameters that have to be automatically adjusted. This led to various efficiency problems. Second, unknown words (i.e., words not included in the training data) are generally not processed accurately (or are just ignored) in this approach. Finally, groups of words are sometimes not translated, leading to strange and inaccurate translations. For some time, this prevented purely neural approaches from being effectively deployed in commercial systems. This is, however, no longer the case, since efficient solutions are emerging.

Optimization techniques have been used to reduce learning complexity in the encoder as well as the decoder. What are called “attention” mechanisms also play a growing role in neural network architectures, especially for machine translation. The “attention module” helps create connections between the encoder and the decoder, a bit like the way in which a transfer rule in a rule-based machine translation system formalizes how a linguistic structure in the source language must be rendered in the target language. However, the analogy should not be taken too far: here, again, the process is a lot more abstract than what can be found in traditional transfer rules. Intuitively, the approach is based on the fact that some words in the source sentence are especially important when it comes to translating a specific word in the target language (or, put differently, not all the words in the source sentence are equally relevant at any time in the translation process). When one translates from French to English, both languages have relatively similar structures, so that the translation process can be relatively sequential, especially when dealing with short sentences (10 words or less): knowing the n previous words in the source sentence is often enough to produce the next word in the target language. Longer sentences have a more variable word order; attention mechanisms then help the system to dynamically focus at any time on the most relevant parts of the sentence to be translated. It can be useful, for example, to keep in memory the fact that there is a link between a verb and its subject (especially if a long sequence of words is inserted between them): this link may play a prominent role, for example, to control agreement when the system generates the verb in the target language. It has been shown that attention mechanisms capable of focusing on relevant source words considerably increase overall translation quality.

Attention mechanisms are especially useful to deal with long sentences, but it is also assumed that they will play an even stronger role when dealing with typologically diverse languages, for example when translating between English and Japanese, since in Japanese the verb is located at the end of the sentence.

The unknown word problem is a real challenge for deep learning approaches, since only words contained in the training data are part of the model and can thus be translated. When statistical systems were modular, it was easy to integrate a module dealing with unknown words. It is more difficult with the deep learning model, which offers a more holistic approach. Some “patches” have, however, recently been found. Because unknown words are in fact often proper nouns or numbers, some systems just “copy” the unknown word from the source to the target language. When writing systems are different (for example, when translating from Arabic or Chinese to English), transliteration works reasonably well and can be a valuable solution.

Unfortunately, it is also quite frequent to find unknown words that are neither proper names nor numbers. A working solution consists then in trying to decompose unknown words into smaller units so as to find relevant cues to help the translation process, but this approach is not fully satisfactory; unknown words remain an open problem for deep learning approaches.

Lastly, it is necessary to verify that the system translates the whole sentence and does not omit sequences of words from the source sentence. This may seem surprising for such sophisticated machine translation systems, but the truth is that since the sentence is analyzed globally and not decomposed into segments, as in previous statistical approaches, the system can fail to translate some words or phrases because they are loosely related to the core of the sentence or for other more mysterious reasons. To solve the problem, the Google team proposes to implement a length penalty to help the system favor the longest translations, so as to decrease the weight of a candidate translation in which part of the initial sentence is not translated. This trick is simple and efficient, but this problem shows that it is hard to understand and analyze the way a neural system works, since the internal representation of the data is purely numerical, huge and complex, and more importantly not directly readable by a human being. A very promising trend of research is however to try to get meaningful representations of the internal model calculated by the neural network, so as to better understand how the whole approach works.

The deep learning approach to machine translation (or neural machine translation) has proven efficient, first, on short sentences in closely related languages, and more recently on long sentences as well as more diverse languages. Progress is very quick, and the deep learning approach can be considered a revolution for the domain, as was the statistical approach at the beginning of the 1990s. It is interesting to note that deep learning approaches spread very quickly. All the major players in the domain (Google, Bing, Facebook, Systran, etc.) are moving forward to deep learning, and 2016 saw the deployment of the first online systems based on this approach. This can be contrasted with the advent of the statistical approach, which took several years to dominate the market and supersede rule-based systems. The deployment of deep learning solutions is much faster. It also means that the approach is now robust and mature enough to outperform statistical approaches.

However, this approach is still in its infancy, and rapid progress can be expected in the near future. More efficient solutions will be proposed to the problems described above, for example to deal with unknown words. It should also be noted that some actors of the domain still favor a more modular solution so that specific issues can be solved more accurately (and neural networks are then just introduced locally, in some modules of a traditional statistical machine translation system, for example). In a way, this may go against the philosophy of the neural approach, since processing data at sentence level is the source of most of the improvements described in this chapter. The future will tell which approach is the best. As a conclusion of this chapter, we should remember that the world chess champion was beaten by a computer in 1997, the world Go champion was beaten by a computer in 2016, but no computer is able to translate accurately between two languages even today! This shows the complexity of natural languages.

Note

This is what happens in practice, anyway. The question is then, in fact, whether the structure inferred by the computer makes more sense than the syntactic structure a human would provide.