Home : Machine Translation (The MIT Press) - yanghaocsg/machine_translation GitHub Wiki

1 Introduction

In Douglas Adams’ humorous saga The Hitchhiker’s Guide to the Galaxy,1 all one needs to do to understand any language is to introduce a small fish (the Babel fish) into one’s ear. This improbable invention is of course related to the idea of a universal translation device,2 and more generally to the key problem of language diversity and comprehension. The name of the fish is a transparent allusion to the Biblical episode of Babel, when God scrambled language so that humans could no longer understand one another.

A significant number of thinkers, philosophers, and linguists—and, more recently, computer scientists, mathematicians, and engineers— have tackled the question of language diversity. Moreover, they have imagined theories and devices intended to solve the problems caused by this diversity. Since the advent of computers (after the Second World War), this research program has materialized through the design of machine translation tools—in other words, computer programs capable of automatically producing in a target language the translation of a text in a source language.

This research program is very ambitious: it is even one of the most fundamental in the field of artificial intelligence. The analysis of languages cannot be separated from the analysis of knowledge and reasoning, which explains the interest in this field shown by philosophers and specialists of artificial intelligence as well as the cognitive sciences. This brings to mind the test proposed by Turing3 in 1950: the test is successfully completed if a person dialoguing (through a screen) with a computer is unable to say whether her discussion partner is a computer or a human being. This test is foundational, because developing an operational conversational agent presupposes not only understanding what the discussion partner says (at least to some extent), but also inferring from what has been said a relevantutterance that helps the whole conversation move forward. For Turing, if the test is successful, it means that the machine has a certain degree of intelligence. This question has fueled considerable debate, but we can at least agree on the fact that a robust conversational system would involve formalizing some mechanisms of understanding and reasoning.

The analysis of languages cannot be separated from the
analysis of knowledge and reasoning, which explains the
interest shown by philosophers and specialists of artificial
intelligence as well as cognitive sciences in [machine
translation].

Machine translation involves different processes that make it at least as challenging as developing an automatic dialoguing system. The degree of “understanding” shown by the machine can be very partial: for example, the Eliza system developed by Weizenbaum in 1966 was able to simulate a dialogue between a psychotherapist and his patient. The system in fact just derived questions from the patient’s utterances (for example, the system was able to produce the question “why are you afraid of X?” from the sentence “I am afraid of X”). The system also included a series of ready-made sentences that were used when no predefined patterns seemed to be applicable (for example “could you specify what you have in mind?” or “really?”). Despite its simplicity, Eliza had great success, and some patients really thought they were conversing with a real doctor through a computer.

The situation is completely different when considering machine translation. Translation requires in-depth understanding of the text to be translated. Moreover, transposition into another language is a delicate and difficult process, even with news or technical texts. The aim of machine translation is not, of course, to address literature or poetry; rather, the idea is to give the most accurate translation of everyday texts. Even so, the task is immensely difficult, and current systems are still far from satisfactory.

However, and despite its limitations, from a more theoretical point of view, machine translation also makes us take a fresh look at old and widely investigated questions: What does it mean to translate? What kind of knowledge is involved in the translation process? How can wetranspose a text from one language to another? These are some of the questions that are addressed in this book.

This short book aims at providing an overview of the progress in machine translation since the Second World War. Some pioneers will be mentioned, but it is mainly the research implemented with computers that will be addressed. The content of the book is thus partly historical, since the main approaches to the problem will be presented in an intuitive manner: the idea is to make sure that the reader can understand the main principles without having to know all the technical details. Specifically, recent approaches based on the statistical analysis of very large corpora of texts will be presented, but these approaches are highly technical and we will skip the mathematical details that are not necessary to grasp the overall idea. More technical books exist for those who are interested in the full details of the different approaches.

The book begins with a presentation of the main problems one has to solve when developing a machine translation system (chapter 2). The journey continues with a quick overview of the evolution of machine translation (chapter 3), followed by a more detailed presentation of the history of the field, from its beginnings before the advent of computers (chapter 4) to the most recent advances based on deep learning (chapter 12). Along the way, we will encounter all the main approaches developed since the field’s beginning: rule-based approaches (chapter 5) up to the ALPAC report and its consequences (chapter 6); and the advent of parallel corpora (chapter 7), which fueled research in the field after the 1980s, first through the example-based paradigm (chapter 8), then through the most popular statistical paradigm (chapter 9) along with its more recent developments—the segmentbased approach (chapter 10) and the introduction of more linguistic knowledge to the systems (chapter 11). This book is not limited to a presentation of the main approaches to the problem: we will also address evaluation issues (chapter 13), which can be either manual or automatic, and the closing chapter will give some details about the commercial situation of the field as well as its main actors worldwide (chapter 14). Although the domain is evolving quickly, including from a commercial point of view, we think it is important to address industrial issues since machine translation is now a key technology for telecommunications. Lastly, we conclude with some observations on the current state of the field (chapter 15) and provide some references for further reading.

## Notes

  1. The Hitchhiker’s Guide to the Galaxy was originally a radio comedy broadcast (1978) before giving birth to different adaptations, including comics, novels, TV series, and plays.
  2. Babelfish is also the name of a machine translation system that was very popular on the web in the late 1990s.
  3. Alan Turing was a British mathematician, logician, and computer scientist. He played a major role in the development of computer science, and his life has recently been popularized in the movie The Imitation Game (2014).