chapter 5 The Beginnings of Machine Translation: The First Rule Based Systems - yanghaocsg/machine_translation GitHub Wiki

5 The Beginnings of Machine Translation: The First Rule-Based Systems

The postwar period saw the advent of the first computers, and machine translation was immediately considered a key application. Several factors explain this keen interest: first and foremost, a pressing need (i.e., the need to automatically translate texts from foreign sources in the context of the Cold War), and secondly, strong theoretical issues (i.e., the question of how language works). Furthermore, progress in the field of cryptology during the war gave a glimpse of a possible solution: couldn’t a document in a foreign language be considered an encrypted document that needed to be translated into an intelligible language? However, the first practitioners in the field quickly faced the limitations of the first computers. As a result, they developed a pragmatic approach based on bilingual dictionaries and transfer rules, making it possible to change word order according to the specificities of the target language. These systems can include thousands of rules and are thus highly sophisticated, but are then hard to maintain. This approach, known as the rule-based approach, has been the dominant one for decades, and it continues to be popular today

The Precursors

The first research attempts in the domain of machine translation were made in the United Kingdom, where Andrew Booth was concerned with data storage, and then in the United States with Warren Weaver, who sketched out a strategy for the domain with his seminal memorandum.

Early Experiments

Toward the end of the 1940s, Andrew Booth, from London University’s Birkbeck College, became specifically interested in language processing by automatic means. His thinking was purely theoretical at the outset, since the first computers were being developed at the same time. The laboratory at Birkbeck College was an important research center on data storage and access. The size of electronic dictionaries was to cause major issues for decades due to the small storage capacity available on early computers. Booth also furthered research concerning machine translation and voice recognition.

In order to limit the number of entries within a dictionary (as in a standard dictionary, for example, where only the infinitive of verbs is recorded rather than all their inflected forms), Booth also took an interest in morphology. His algorithm searched only for sequences of characters: if a word was unknown—that is, if it was not included as such in the dictionary—the system tried to gradually remove letters from the end of the word in order to eventually find a known word form (for example, “run” from “running”). Despite its apparent simplicity, this technique works relatively well for English and continues to be used, particularly by search engines. The technique, called “stemming,” makes it possible to get pseudo-roots for words without having to perform an advanced morphological analysis. Martin Porter popularized this technique in 1980 for search engines, and the technique is thus now known as “the Porter stemming algorithm.”

This research was, to some extent, a continuation of Artsrouni’s and Trojanskij’s work on how to store multilingual dictionaries using mechanical means. Booth improved on these early attempts by adding a search algorithm that foreshadowed research on dictionary storage and management. With Richard H. Richens, he also created a wordfor- word translation system based on bilingual dictionaries. These propositions were the first step toward a global approach to automatic translation but were quickly recognized as too simplistic, particularly by Weaver.

Weaver’s Memorandum

The father of machine translation—and more generally of natural language processing—is unquestionably Warren Weaver. Along with Claude Shannon, he was the author of a mathematical model of communication in 1949. His proposal was very general and therefore applicable to many contexts.

In Weaver and Shannon’s model, a message is first encoded by a source (which can be a human or a machine), sent, and then decoded by a receiver. For example, a message can be coded in Morse code, transmitted by radio, and then decoded in order to be comprehensible by a human. This model is the foundation of cryptography (encoding, transmission, and then decoding of the message) but can also be applied to communication in general: an idea, in order to be shared, must be “encoded,” that is, “put into words,” and transmitted to a hearer, who must then “decode” the message in order to understand its meaning. The same goes for translation, which can be seen as decoding a given text (the text is considered “encoded” in an unknown language: in order to be comprehensible, it must therefore be translated; in other words, decoded in the target language).

Beginning in 1947, Weaver corresponded with the cyberneticist Norbert Wiener concerning machine translation. He proposed that translation could be considered a “decoding” problem: One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”1

For Wiener, the automation of translation was not directly possible, because language is made up of a large number of words that are either too vague or too ambiguous (in other words, one cannot translate by assuming simple and direct equivalences at word level).

He wrote: As to the problem of mechanical translation, I frankly am afraid that the boundaries of words in different languages are too vague … to make any quasi-mechanical translation scheme very hopeful.2 Wiener’s notion of “boundaries of words” refers to the fact that a word like “avocat” in French has at least two meanings and thus at least two possible translations in English: “avocado” or “lawyer.” This scenario, which is far from exceptional, is in fact omnipresent in language, since the majority of words have several meanings, and since the meaning of words is different in each given language (moreover, we can observe that any time the word “avocat” refers to a man who practices law, it refers to a “lawyer” but that, on the contrary, “lawyer” does not always refer to an “avocat”: the word can also refer to other types of magistrates!). As a result, determining the meaning of a word and its possible translation in a given language seemed to be an almost insurmountable problem for Wiener, as it involved handling tens of thousands of “word meaning” pairs as well as actively determining the meaning of each word in context, at a time when computers still had very limited computational power and memory capacity.

Despite Wiener’s doubts, Weaver carried on with his idea, and in 1949 he drafted a brief text expressing his thoughts on the subject. He specifically mentioned that words are often ambiguous, that their meaning depends on context, and that word-for-word translation is not a sufficient basis for high-quality results (Weaver was also corresponding with Booth about his research, and as a result became aware of the limitations of word-for-word translation). Weaver’s reservations were not completely ignored but were largely discounted, which would have consequences in the future.

Weaver’s text, entitled “Translation,” is generally considered the starting point of research in this field. The memorandum was very influential, because in it Weaver developed ideas that were highly innovative for the time, but also because he was closely involved with an organization that financed research.3 His influence was as much scientific as it was political.

Weaver proposed four specific principles in order to avoid the basic errors of a word-for-word translation:

Analyzing the context of words should make it possible to determine their precise meaning. The size of the context to be taken into account should vary according to the nature of the word (Weaver claimed that only a few nouns, verbs, and adjectives need to be disambiguated), but also possibly according to the topic and the genre of the text to be translated, if these elements are known.
It should be possible to determine a set of logical and recursive rules to solve the problem of machine translation, he wrote, “insofar as written language is an expression of logical character.” According to Weaver, this excludes “alogical elements in language” such as “intuitive sense of style, emotional content, etc.,” but machine translation can nevertheless be considered for the most part as a logical deduction problem.
Shannon’s model of communication could probably provide useful methods for machine translation, since it had already proven useful “for solving almost any cryptographic problem.” In Weaver’s words: “It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the Chinese code.”
Languages can be described with universal elements that may help facilitate the translation process. Rather than directly translating from Chinese to Arabic or from Russian to Portuguese, it is probably best to search for a more universal and abstract representation that avoids any errors due to a verbatim rendering or to ambiguity.4

Each of these points deserves a closer look, as Weaver’s suggestions are for the most part still being explored today. The first principle highlights the fact that most ambiguities can be solved by looking at the near context, which is the approach still used today. This is not enough to solve all kinds of ambiguities, but it is enough to solve most of them. However, the memorandum underestimated the problem of ambiguity. Weaver wrote: “Ambiguity, moreover, attaches primarily to nouns, verbs, and adjectives; and actually (at least so I suppose) to relatively few nouns, verbs, and adjectives.” We now know that ambiguity is the most pervasive problem in natural language processing and applies to nearly all kinds of words, which makes ambiguity a much bigger problem than initially thought.

Ambiguity is the most pervasive problem in natural language processing and applies to nearly all kinds of words, which makes ambiguity a much bigger problem than initially thought.

The second principle was based on work done in logic and had a profound influence on the concept of formal grammar, which is used for analyzing artificial languages (particularly programming languages) as well as natural languages.

The third principle focuses on the comparison with cryptography, which at the time was a very popular research area due to the war. It highlights the statistical nature of language and the fact that computers could help solve difficult problems, especially in semantics. The following decades saw the development of logical approaches in language processing, and statistics were generally assumed to be too crude or even useless for the problem. The revival of statistical approaches in natural language processing in the 1990s showed how right Weaver was, but this kind of approach requires large amounts of data, which explains why this proposal did not become popular before then.

Finally, the last principle inspired numerous research projects aiming at developing interlingual representations, addressing the semantic content of sentences, and disregarding the particularities of each language.

Weaver mentioned several times in the memorandum that his point of view reflected his personal thoughts, which were not those of a linguist (“I have worried a good deal about the probable naïveté of the ideas here presented”). For him they were food for thought, most likely naïve, which should be reviewed by experts on the subject. Yet the memorandum was in fact very far-sighted, and that is what has ensured its remarkable posterity. It highlighted ideas that were explored during decades to come by symbolic approaches (i.e., the need for accurate semantic representations or for formal rules) as well as statistical ones (i.e., the fact that statistics are more powerful than symbolic rules to resolve ambiguities).

The implementation of the proposed techniques, however, required efforts that went beyond anything the pioneers of machine translation had ever imagined. In particular, the inherent ambiguity of natural languages showed that traditional encryption models were not sufficient to render the complexity of automatic translation. The Real Beginnings of Machine Translation (1950–1960)

Weaver’s memorandum and the perspectives it opened, as well as the proximity of the author to funding agencies, were the driving forces in the rapid development of research within this domain.

The Early Days

In the early 1950s, several researchers started to become interested in machine translation, which seemed to be both a useful and logical application at the time. As already mentioned, two elements in particular played a determining role: (i) the work done on cryptography seemed then, following Weaver’s ideas, to form a solid foundation for machine translation seen as a coding and decoding problem; (ii) the context of the Cold War also contributed to emphasizing the need for translation, especially from Russian into English in the Western world (and vice versa in the Soviet world). It was in this context that an Israeli researcher, Joshua Bar-Hillel, played a leading role in the development of machine translation in the United States during the 1950s. Bar-Hillel spent two years at MIT in 1951–1953, working as a post-doctorate fellow under Rudolf Carnap. Bar-Hillel had actually first corresponded with Carnap while he was working on his thesis in Israel back in the 1940s. Carnap, the German philosopher who later became a naturalized American, had developed a “logical syntax of language,” which seemed to pave the road toward a logical formalization of natural languages.

Bar-Hillel then naturally became interested in machine translation. He quickly became a major figure in the field and benefited from grants that allowed him to visit major laboratories in the United States (research teams were being formed at the time and were relatively scattered among various American universities). Upon his return to MIT, Bar-Hillel drafted a document pointing out the interest of the field but also highlighting the difficulties of its task (this document in some ways echoed the conversation between Wiener and Weaver that had occurred only a few years earlier). Immediately afterwards, he organized the field’s first conference at MIT in June 1952.

The majority of researchers active in the field attended the conference at MIT. The attendees were clearly excited and emphasized the need to attract large amounts of funding, given that machine translation required human capacities, and especially access to computers that were extremely expensive at the time. In order to promote machine translation, the representative from Georgetown University (a major research center and pioneer in the field) suggested that a demonstration be organized as soon as possible in order to show the feasibility of the project and attract funding.

In 1954, the research team at Georgetown University, along with IBM, led the first demonstration in support of machine translation based on a system developed jointly by the two teams. A set of 49 Russian sentences was translated into English using a relatively simple dictionary (a dictionary of only 250 words and six grammar rules). The impact of the demonstration was considerable and contributed to the increase in financial support for machine translation. There was also extensive media coverage of the event, which helped attract public attention.

American funding agencies gradually began to support a number of groups working on machine translation, primarily in the United States and the United Kingdom. The 1954 demonstration also grabbed the attention of the U.S.S.R. and several Soviet research teams, who became involved in the field from 1955 on. The field of machine translation was institutionalized with regular conferences and a specialized journal, Mechanical Translation, first issued in 1954.

The Development of the First Rule-Based Systems: Turmoil and Enthusiasm

The majority of research teams at the time had very limited access to computers, which were not widespread, especially in the U.S.S.R. In fact, most of the work remained theoretical and offered approaches that could “mechanize” the translation process, without ever being put into practice.

Schematically, it can be said that two lines of research were pursued: in the first, the “pragmatic” route aimed at quickly producing results, even if those results were not perfect. The systems were essentially based upon a direct translation approach: bilingual dictionaries first provided a verbatim translation, and then reordering rules were applied to accommodate the word order of the target language. In other words, a dictionary was first used to find equivalences between words and then basic reordering rules were used to control certain phenomena, such as noun-adjective phrases in French that must be translated as adjective-noun in English (“voiture rouge” → “red car”).

At the same time, those in favor of a more theoretical approach highlighted the limits of the direct approach. Numerous proposals were then made to promote an analysis of the source text before the translation process, and to develop transfer rules operating at the syntactic or semantic level (and not just at word level). In this regard, it is relevant that the notion of formal grammar became prominent in the 1950s, mainly with work by Noam Chomsky. Certain research centers were also interested in the idea of a pivot language (i.e., an approach in which a particular language is used as a kind of intermediate representation between the source language and a target language), or even in the idea of an interlingua (i.e., an artificial language that offers an abstract representation of the sentences to be translated). In both cases, the approach consisted in encoding all the necessary information needed for translation in a specific representation model. The interlingua is thus an artificial language that has nothing to do with any existing language, whereas a pivot language uses an existing language (generally English) for this representation.

Several research groups (in Washington and at Harvard and the Rand Corporation, for example) made every effort to develop large bilingual dictionaries (Russian-English), either manually or with the help of a statistical analysis of specific corpora, which helped ensure that the most frequent or the most important words would be processed first. Polysemy—the fact that a single word can have several meanings, such as “bank,” which can refer to a financial institution or the side of a river5—was seen from the beginning as one of the major problems to solve. The simplest approach would be to include only the most expected word meanings in the dictionary. While doing so would solve the problem, clearly it is too extreme: the results of the direct approach (with no semantic disambiguation process) are therefore unsatisfactory. The translation fragments provided, even if highly imperfect, can nonetheless be useful if the reader has no knowledge of the target language, or can serve as a basic “translation memory” by providing regular equivalences between languages.6

In order to solve the ambiguity issue, many research teams gradually enriched the content of their electronic dictionaries. For example, the University of Washington added contextual information to words so that ambiguities could be resolved without a full syntactic analysis. Vocabulary was also partitioned by domain (the idea being that “bank” will probably not have the same meaning in a financial corpus as in an environment corpus, or at least that significant statistical differences will help in the disambiguation task) and multiword expressions were gradually added (to avoid some sources of ambiguity with simple words). The approach may sometimes seem ad hoc, but it should be noted that current techniques still have significant similarities with the strategies identified in the 1950s: a local analysis is often enough to determine the category of a word, even its meaning. Storing multiword expressions and taking into account the domain does indeed help to drastically reduce the ambiguity problem: “one sense per discourse” even became a popular slogan in the field during the 1990s.

At the same time, as already seen, more fundamental work on parsing—that is, the automatic syntactic analysis of sentences—began to appear. Chomsky independently developed his own work on syntax but did not have any significant influence on machine translation until the 1960s. However, the need for a formal analysis of source languages gradually became a mainstream idea. Several research groups actively developed this strategy for machine translation in the late 1950s. One must keep in mind that, at the time, the formal analysis of languages targeted both programming (i.e., artificial) and natural languages. It was therefore not self-evident that, since natural language processing ultimately has very little to do with programming languages, the ambiguity issue in natural languages meant that very specific strategies had to be designed for this field. The first specific formalisms were then defined: they were “stratified” (using an expression coined by Sydney Lamb, who defined a “stratificational grammar”), ranging from low-level information (word categories, morphosyntactic features) to high-level information (the meaning of words and their possible context). These pioneering studies were quite valuable and instructive. A number of research groups recognized the great difficulty of the task, especially the “semantic barrier” that Wiener, and particularly Bar-Hillel, had anticipated since the beginning of the 1950s (see above).

Beyond the United States

Before continuing to the next period, we must point out the research done outside the United States. Since the mid-1950s, Cambridge University, through the Cambridge Language Research Unit, had benefited from American subsidies and developed one of the first interlingual systems, called NUDE. According to its designer, Richard Richens, NUDE was a “notational interlingua ... constructed so as to represent the ideas of any base [source language] passage divested of all lexical and syntactical peculiarities; for which reason it [was] called Nude” (Richens, 1956, cited in Sparck Jones, 2000). The NUDE interlingua aimed to define each word by means of a set of universal primitives (core meanings that can be assembled to represent the meaning of complex ideas expressed in various ways, depending on the natural language in question). The implementation of this approach remained limited and seems to have suffered from a poor link to syntax (so that it was not clear how a NUDE representation could be derived from an actual text). Nonetheless, this proposal remained important since it opened a new strand of research and popularized the idea of universal semantic primitives that can be found in numerous linguistic theories all over the world. More generally, the Cambridge group prioritized the development of semantic resources (word lattices) and techniques that were partially rediscovered years later for semantic disambiguation (i.e., choosing the meaning of ambiguous words according to the context). Of course, it did not develop definite answers to questions that are still largely debated today, but it was a pioneering and influential group in the study of semantics during a time when attention was primarily focused on syntax.

Other research groups in the field of machine translation appeared toward the end of the 1950s, for example in 1956 in Japan and 1957 in China. In France, the interest was clear from the late 1950s on, and two centers were then created by the French National Center for Scientific Research (CNRS), in Paris and Grenoble. The interest in machine translation was simultaneous with the first computers intended for university centers in France, and was, therefore, the real beginning of computer science in the country. The two centers were called Centre d’Études sur la Traduction Automatique, or CETA: CETAP was located in Paris, and CETAG in Grenoble. The Parisian center encountered financial problems from very early on and had to bear the consequences of the criticism of machine translation that was emerging in the United States. In fact, the center closed a few years later and some researchers, such as Maurice Gross, turned to computational linguistics, stressing the need to first develop rich linguistic resources that offer a broad and systematic description of language. The Grenoble center has survived to the present day and developed an original interlingual approach. As a result, Bernard Vauquois, who led the CETAG center in Grenoble and proposed several influential ideas in the field, became one of the major figures until his death in 1985, although machine translation was then no longer as popular elsewhere in the world.

Finally, a few words must be said about the research being carried out during the same period in the USSR. The Georgetown-IBM demonstration in 1954 made a strong impression in the Soviet world, which immediately decided to launch research in this domain. Several groups rapidly began working on problems in machine translation, primarily in Moscow but also in Leningrad and in other “sister countries.” The first congress on machine translation organized in Moscow in 1958 was attended by about 340 participants from 79 different institutions. The approaches were as diverse as in the United States, but the majority of the research remained theoretical due to the unavailability of computers. The few groups lucky enough to have access to computers essentially developed empirical and direct approaches based on bilingual dictionaries. At the same time, many theoretical studies specified strategies for an automatic syntactic analysis, but also for the coding of semantic information. Linguistic theories dating from this period still have a large audience to this day.

The work of Igor Mel’čuk and Yuri Apresjan, in particular, is well known today, including outside of the former Soviet world, especially because Mel’čuk settled in Canada in the late 1970s.

A Period of Disenchantment (1960–1964)

The end of the 1950s saw the first doubts expressed about the feasibility and even possibility of obtaining correct translations as the outcome of an automated process.

Bar-Hillel’s Criticism

Bar-Hillel, who had returned from Israel at the end of his postdoctorate position in 1953, had the opportunity to return to the United States a few years later for a new research residency (1958–1960). In September 1958, during his trip to the United States, he presented a paper to the University of Namur entitled “Some linguistic obstacles to machine translation.” In this text, Bar-Hillel lists some linguistic issues that he considered to be fundamental problems for machine translation, since no system was then able to solve them. In his opinion, the models at the time were too simple and needed to be replaced by models that would better account for the structure of the sentences to be analyzed.7 Furthermore, according to Bar-Hillel, the transfer rules required to translate between genetically distant languages had to be complex and required formalisms yet to be invented. After his conference in Namur, Bar-Hillel continued his trip to the United States to assess the research being conducted in the field.

There, he drafted the famous technical report entitled “Report on the State of Machine Translation in the United States and Great Britain” (February 1959) on behalf of the U.S. Office of Naval Research. The report delivered an extremely negative assessment of the ongoing work, without considering the very limited history of the field (most of the groups had only existed for a few years). All the research groups were listed by name and severely criticized.

Practically, Bar-Hillel noted that, on the one hand, translation needs a complete syntactic analysis of a text, which was not completely obvious for all the groups involved in the field at the time. On the other hand, translation needs to resolve semantic ambiguities, which was beyond the state of the art at the time and did not seem solvable in the medium term. An appendix of the report had an evocative title (“A demonstration of the non-feasibility of fully automatic, high-quality translation,” see Bar-Hillel, 1958 and 1959) and was intended to show that the meaning of some ambiguous words cannot be determined, even when taking into account the context, which suffices to invalidate the goal of high-quality machine translation. Bar-Hillel used the following well-known example: “Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.”

In order to understand the sentence, one must realize that the word “pen” refers to a small enclosure in which a child plays, and by no means to a writing utensil. Yet there is nothing in this context that enables the reader to infer this meaning for “pen,” which is much less common than an implement for writing. According to Bar-Hillel, such an example demonstrated the impossibility for any system to solve this kind of problem, which he believed would happen quite frequently. It was therefore impossible to envision a completely automatic, highquality translation in the short or medium term (FAHQT, or fully automated high-quality translation; also found as FAHQMT, or fully automated high-quality machine translation).

Instead of automatic translation, Bar-Hillel recommended that researchers turn toward computer-assisted translation systems, which constitute a relatively different project, clearly less exciting scientifically speaking than the idea of an entirely automatic system. Bar-Hillel called for the development of translation aids, which would significantly help the productivity of translators by proposing suitable and efficient tools, specifically for the pre-and post-edition stages (preparing the text for translation; correcting translation errors). Since the goal is then to help translators, system outputs must be relatively different from those of traditional machine translation systems: for example, it is generally better to present the translator with suggestions of translations rather than directly produce a text, which would be difficult to correct.

Discussion As we have seen, the 1950s, which had gotten off to a flying start, subsequently ended on the first doubts regarding the feasibility of machine translation. Bar-Hillel’s report focused on real problems that had been underestimated up until that point. The approaches considered initially failed largely due to oversimplification: the hopes of advancing rapidly were too optimistic, and initial results proved disappointing. The 1954 demonstration was based on sentences that were prepared in advance, with a familiar vocabulary and limited ambiguity that clearly had little to do with the reality of the task, which concerned previously unseen texts from any domain. Similarly, most research groups in the 1950s did not realize the need for a syntactic or semantic analysis, and therefore did not evaluate the difficulty of the task properly. Finally, the idea of a translation aid was more realistic if the goal was to provide a quick operational response, but this had little to do with the advancement of machine translation. The research in the 1950s nonetheless established the field of machine translation. The setbacks, or at least the limitations, of these early systems revealed the complexity of natural language processing. In some ways, they sparked off numerous research projects pursued in the following decades. Although machine translation was too ambitious at the time, the research was not useless. We must also keep in mind the relative lack of computers and their limited capabilities— at the time of punch cards—which drastically restricted possibilities for experimentation.

However, Bar-Hillel’s report raised doubts not only for those funding the research, but also for researchers themselves. Several leading figures left the field at the beginning of the 1960s and moved to research in linguistics, computer science, or information theory. Certain researchers were even more negative than Bar-Hillel himself on machine translation.

Alternatively, the demonstration gave a glimpse of the numerous problems the first projects had underestimated. Georgetown’s and IBM’s attempt to industrialize practical solutions yielded very poor results.

All of this brought funding agencies in 1964 to ask an independent committee for an evaluation report. Their request led to the famous ALPAC report, published in 1966.

Notes

Weaver, letter to Wiener, March 4, 1947.
Wiener’s response to Weaver’s letter, April 30, 1947.
Weaver worked at the Rockefeller Foundation, where he was responsible for launching new research projects.
“Thus may it be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and then re-emerge by whatever particular route is convenient” (Weaver, “Translation,” 1955, 23).
The fact that most words can belong to several categories, such as the word “bank,” which can be a noun or a verb, also poses an important problem for automatic systems. The correct analysis of a given sentence requires at the very least a proper recognition of the main verb, since it is the verb that structures the whole sentence. But even this is not a trivial task for a computer!
A translation memory is a database that contains previously translated fragments of texts in order to help professional translators quickly find equivalences, while ensuring more regular and consistent translations.
“The model … was … too crude and has to be replaced by a much more complex but also much better fitting model of linguistic