chapter 14 The Machine Translation Industry: Between Professional and Mass Market Applications - yanghaocsg/machine_translation GitHub Wiki

chapter 14 The Machine Translation Industry: Between Professional and Mass-Market Applications

Machine translation is a popular application because it answers a very direct and simple need. Everybody can clearly see the importance of a system that is capable of automatically translating texts from a source language into a target language. It is possible nowadays to access foreign newspapers online without having to master a foreign language. It is even possible to exchange messages on social networks by breaking down language barriers. All this means that economic issues are now important.

A Major Market, Difficult to Assess

The needs and budgets related to translation (by humans or by machines) are unknown and almost impossible to estimate. Companies and public administrations very rarely give information on their translation budgets. Furthermore, the market is extremely fragmented, since translators are often self-employed. Evaluations on the order of several billion dollars (14 to 100 billion) are mentioned here and there, but these estimations are rather unreliable. This is reflected in the considerable variations in published figures.

A Quick Overview of the Market

The European Commission is often cited for its translation budget.

This budget is indeed quite large, as some documents must be available in more than 20 languages. According to the Directorate General for Translation,1 the European Commission’s internal translation budget was around 330 million euros in 2013. The number of pages translated has increased to more than two million per year, and more than 93% of these translations are done entirely manually.

In fact, according to the same source, less than 5% of translations benefit from automatic aided tools (via the web or internal tools).2 All of the texts translated for the European Commission are of a technical nature, but the variety of genres and topics addressed is very broad, even if legislative texts predominate. Among the translated documents are technical reports (“white papers”) on various topics, correspondence with member states, and websites. In a context such as the European Commission, it is easy to imagine that machine translation could provide valuable services. This is certainly the case when one has to translate specialized and recurring documents, for which the current state-of-the-art technology could give reasonable results. The European Commission has indeed been a longtime investor in machine translation, with Systran particularly, as we have already seen (see below in this chapter for a more detailed presentation of commercial systems, as well as the history of Systran).

More recently, the Commission has funded the production of free software and resources, which has allowed for significant advances in the field. Worthy of mention are the parallel corpora Europarl and JRC Acquis (see chapter 7), as well as the development of the software platform Moses3 thanks to several European funded projects.

Beyond the case of the European Commission, there are a multitude of industrial contexts where technical texts must be translated and updated regularly. Environment Canada’s weather forecasts are a good example of such a case, as several versions of the weather forecasts have to be produced daily in both French and English. The production of these bilingual forecasts has been automated since the 1970s with some success (see chapter 6). It is somewhat surprising that this application has been for a long time the flagship application in machine translation; no other iconic system has emerged for other applicative domains.

The production of multilingual leaflets and manuals, as well as localization (i.e., the adaptation of a piece of software for various countries) are important markets for machine translation. Everyone has already experienced trying to understand a leaflet, written in a hard-to-understand manner that was without a doubt the result of an automatic translation. Producing a document (a leaflet, a manual, etc.) in several languages and keeping it up to date comes at a high price, especially for manufacturers selling cheap products and having a low budget for translation. In this context, machine translation is seen as an interesting technology, one that produces texts that can be reviewed by a human translator. Of course, when the process is entirely automatic, with nobody involved to check the results, translations are often of very poor quality.

Another important market for machine translation is access to international patents, which requires specific resources. Patents can be written in a wide variety of languages. Manufacturers launching a new product on the market must ensure that it does not violate a patent in one of the countries concerned. There is thus a specific need to break the language barrier, since a patent in a specific language can be the source of major problems with high financial impact. Another related issue is that patents are written in a very specific jargon. Systems must thus be tuned in order to be able to deal efficiently with the specific terms and phraseology of the domain. What is already a great challenge when dealing with one language only becomes even more crucial and difficult in the context of multilingual systems.

Indeed, this field has attracted a great deal of interest, since the commercial and financial profits are high. Large companies are also working on the topic: for example, the European Patent Office is working with Google to propose a machine translation system adapted to the domain. The World Intellectual Property Organization has developed its own translation system based on neural networks in order to translate from Chinese, Japanese, and Korean patent documents into English. Most manufacturers in the field of machine translation offer commercial solutions with regard to the area of patents.

Finally, government intelligence services must be mentioned, as nowadays they are among the largest consumers of machine translation products. This market is little known and even more difficult to assess because, by definition, intelligence services communicate sparingly about their activities. Machine translation has of course to do with the interception of communications. Intelligence agencies cannot analyze all intercepted messages, nor have specialists available for all relevant languages: it is thus crucial to be able to automatically identify the language used in these messages and translate some of them automatically, at least superficially. It is easy to understand why machine translation is a very useful technology for the field, whether applied to written documents or spoken transcripts. Translation needs often concern languages of so-called sensitive countries and often fluctuate according to international affairs. Bearing in mind national interests, the majority of Western countries have developed more or less discrete partnerships with machine translation companies. The ability to produce efficient systems for new languages in a very short time in order to counter new threats quickly and efficiently is one of the key challenges for the domain.

Clearly, the machine translation market is very fragmented, ranging from modules that are freely available on the web (Bing Translator, Google Translate, Systranet, etc.) to purely commercial tools. Additionally, commercial tools are frequently sold in several versions: it is, for example, common to find a free version of a given piece of software on the web along with a professional version sold through more traditional sales channels, most of the time with related services (especially for the development of specialized dictionaries, terminologies, and phraseologies). It should be noted that most companies do not make much money directly from software sales but get most of their revenue through advertisement or services. The services sold around machine translation mean there is still some convergence between technology and professional translators, who are still needed even in this automation context.

Recently the market has also diversified with the growth of speech translation, a field that is still emerging but very promising in terms of concrete applications, particularly on mobile devices. Lastly, we will see that machine translation also provides useful tools for professional translators, even if most automatic tools are not developed directly with this in mind.

Free Online Software

Since the 1990s, several free machine translation systems have emerged on the Internet. One of the first systems, Babelfish, appeared at the end of the 1990s provided by the search engine AltaVista (the most popular search engine at the time). Babelfish was in fact the result of an agreement between AltaVista and Systran, the technology provider for Babelfish. Babelfish was later sold to Yahoo in 2003 and eventually replaced in 2012 by Bing Translator, a product developed and owned by Microsoft.

Today the most well-known free translation service on the Internet is without a doubt Google Translate. Google has conducted research on machine translation since the beginning of the 2000s in order to develop its own solution. The online translator proposed by Google was initially Systran-based, but Systran was gradually removed as Google developed its own technology, first for Russian, Chinese, and Arabic in 2005, for then an online system capable of translating between 25 language pairs in October 2007. Google’s system now handles more than 100 languages, with very variable quality depending on the language pair considered.

Google Translate is based on a statistical approach, following the model originally developed by IBM. However, it is clear that these models have since evolved tremendously, even if we don’t know the details of the algorithms used, which remain secret. One of the major strengths of Google is that it can rely on its search engine and on its incredible computing power to make the best of the bilingual corpora available on the web. Google’s translation system also integrates terminologies and semantic resources when available and has recently begun to deploy a new generation of systems based on the deep learning approach.

Beyond Google Translate, there are plenty of other free automatic translation software packages available on the Internet. As mentioned above, Microsoft’s Bing Translator has been adopted by Yahoo to replace its earlier system Babelfish. Meanwhile, Systran offers its own online service called Systranet, and Promt, Systran’s main competitor, also offers free translation services online. A multitude of other systems are available directly on the Internet, some specialized in less common languages. A 2010 document compiled by John Hutchins for the European Association for Machine Translation, a compendium of translation software,4 lists dozens of available products on the

Internet. New software and websites appear each week.

Some websites or mobile applications, particularly social networks, also integrate machine translation services to give their users access to content in foreign languages. Facebook and Twitter use Bing Translator to allow end users to access content in foreign languages; recently, Facebook began developing its own in-house technology. Other social networks also integrate machine translation technology. Users can sometimes be unaware that they are reading a machine translation, when this has been displayed automatically, without their intervention (this generally depends on the settings of the social network).

Companies propose these online services for different purposes and get different kinds of revenue from them. For Google and Microsoft, machine translation is considered a key technology in an ecosystem of services aiming to offer better access to information. Machine translation is thus a key component, beyond direct return on investment. Google’s main revenue is from advertising, whereas Microsoft receives most of its revenue from software sales (while seeking at the same time to diversify revenue to advertising). For companies such as Systran or Promt, Internet presence is essential, first and foremost to ensure exposure relative to competing products.

Advertising and online product sales (including the integration of translation services into other websites, generally generating a revenue proportional to the number of translations per month) is another source of income for software companies.

One can also note that these tools are no longer just standalone online applications. It is now often possible to correct the translations obtained directly online. The system can in turn use these manual corrections to identify some errors and correct itself dynamically, and at no cost, simply by integrating the user’s proposed corrections. User feedback is still marginal, but the more a tool has an active community of users, the more it will benefit from this type of feedback. This source of information could prove valuable in the future, especially if automatic approaches reach a plateau (i.e., if improvements slow down after initial rapid progress). In this context, the main source of progress will probably consist in integrating local improvements proposed by users themselves. However, it is generally very difficult and very costly to have access to a community of users, since software customers tend to give little feedback. From this point of view, an online translation service with a large audience is an extremely valuable product.

Finally, online products in no way guarantee the confidentiality of translated data, which are, on the contrary, generally saved and stored by machine translation systems. Most systems keep track of the texts proposed by end users as well as of the proposed translation and use this as a translation memory in which past translations can be found. Thus, companies that need to translate confidential data should by no means use these free products, but should preferably resort to commercial products.

Commercial Products

Along with free products, a multitude of commercial products coexist to respond to various needs and to the different languages represented on the Internet.

Several companies, like Systran and Promt, market solutions for machine translation. Beyond these two companies, many other software companies propose “off-the-shelf” machine translation solutions, sometimes for only a few dollars. These systems are hard to adapt, and their quality is generally quite questionable. This kind of product is now rather marginal and will probably become even more marginal in the future due to the availability of free translation tools online.

A larger market involves the sale of machine translation solutions that can be integrated into websites. We have already seen that Facebook and Twitter first resorted to Bing for translating messages exchanged online, and Facebook is now developing its own “in-house” solution. Almost all large software integrators propose a machine translation solution that can be integrated into a website. IBM, for example, has developed its own product that is sold as a module in the IBM WebSphere platform. Oracle relies on an agreement with Systran. As we have already seen, the European Patent Office turned to Google and signed agreements with other patent offices in order to improve their machine translation technology (for Chinese, in particular).

Beyond these large and well-known worldwide companies, several other companies propose more focused commercial products for different language pairs. Some regional markets are dominated by local companies, such as Promt in Russia or CSLi in Asia. One can also find companies specialized in specific rare languages or more regional areas. The quality of these systems is highly variable. Moreover, as we saw in the previous chapter, performance is highly dependent on the existence of bilingual parallel corpora and lexicons.

As already said, it should be noted that online sales are generally limited, even if advertising can be a valuable source of additional income. Most of the income for traditional software companies in this domain comes from big companies and large administrations. In this regard, the defense sector is extremely important, especially with the generalization of the interception of communications (via telephones or the Internet). In an interview in a French magazine,5 the former CEO of Systran, Dimitris Sabatakakis, once said that Systran would not exist without the American intelligence agencies. Indeed, Systran’s first revenues were due to an initial contract with the US Army in the 1970s. Systran still benefits today from large contracts with various American defense agencies, as we will now see.

The Case of Systran

The oldest and most well-known company in machine translation is without a doubt Systran (whose name comes from the abbreviation “system translation”). Peter Toma, a researcher who had previously worked at Georgetown University during the early 1960s, founded Systran in 1968. The company initially had American defense organizations (like the US Air Force) as its main customers and was naturally interested in the Russian-English language pair.

The company is also known for having worked with the European community for several years. A demonstration was first carried out in 1975 at the demand of the European Commission. This led to a request for a demonstrator that was subsequently installed in Brussels in 1981. The number of languages covered gradually increased, and this contract ensured regular revenue for Systran. Relations with the European Commission deteriorated when the commission, wishing to part company with the vendor launched a call for bids in 2003 in order to improve the translation system and its dictionaries. Systran filed a lawsuit for copyright violation (on software and related information) and disclosure of confidential data to third parties. Systran finally won its case against the European Commission in 2010.

This lawsuit is not anecdotal. It shows that the quality of a translation system is fundamentally related to the resources it uses, especially for a system relying mostly on dictionaries and rules developed by linguists, as was the case for Systran. In this field, it is crucial to be able to respond quickly to new needs, which means being able to cover new languages and new specialized fields without necessarily having very large corpora available. Indeed, specialized companies like Systran and Promt still primarily offer systems that rely on dictionaries and transfer rules (this was especially the case in the 1980s and 1990s for Systran, before the revolution of the statistical approach in the domain). After the success of Google Translate, Systran developed a “hybrid” approach by adding statistical information to the system, but the basis of the translation model remained relatively traditional, and Systran is now focusing on deep learning, like all the major players in the field. The advantage of keeping a relatively traditional approach is that, even without a training corpus, bilingual dictionaries can be developed, as well as transfer rules from one language to another. Depending on the language pair, it may even be possible to recycle some data for a given language, which is a considerable advantage.

This brings us to the defense market. The CEO of Systran, in the interview previously cited, revealed that a quarter of Systran’s revenue during the year 2000 came from US defense industries. The French and Korean markets were also fundamental for the company. We can thus estimate that, in 2000, more than half of the company’s revenue related to the defense and intelligence markets (since the US defense industries were not the only defense and intelligence markets where Systran was active). In this context, it is often difficult to have access to training data, as corpora in this domain are highly confidential. Moreover, the world of military and intelligence services wants to be able to adapt a system itself without disclosing data. It is therefore still relevant to work with dictionaries and rules, since it is easy to add new words to an existing dictionary for example, whereas retraining a statistical system is complex and requires large amounts of bilingual data that may not be available. This largely explains why many commercial systems are still based on a traditional approach, using manually developed bilingual dictionaries, even though statistical approaches now dominate the research landscape.

A Worldwide Market

The importance of this strategic market drove large companies in the telecommunications field to strengthen their teams in the field of speech analysis and machine translation. Several company buyouts have taken place recently: Systran was bought in 2014 by a Korean company, CSLi, who developed the voice analysis and translation systems used by Samsung’s connected devices (cell phones, tablets, and other technological gadgets). Facebook bought out different companies specialized in machine translation (such as Jibbigo in 2013 for voice messages in particular). Apple and Google are also regularly buying startups in the communication and information technology domains. Most importantly, all these large companies are hiring engineers and researchers (mainly in machine learning and artificial intelligence) in order to produce their own machine translation solution. They are also opening new research centers worldwide in order to attract the best talent everywhere.

New Applications of Machine Translation

The machine translation market is growing fast. Over the last few years we have witnessed the emergence of new applications, particularly on mobile devices. Speech translation has become a hot topic (“speech to speech” applications aim at making it possible to speak in one’s own language with another interlocutor speaking in a foreign language by using live automated translation). The machine translation market is growing fast. Over the last few years we have witnessed the emergence of new applications, particularly on mobile devices.

Cross-Language Information Retrieval

Cross-language information retrieval aims to give access to documents initially written in different languages. Consider research on patents: when a company seeks to know if an idea or a process has already been patented, it must ensure that its research is exhaustive and covers all parts of the world. It is therefore fundamental to cross the language barrier, for both the query (i.e., the information need expressed through keywords) and the analysis of the responses (i.e., documents relevant to the information need).

A cross-language system is a system that manages multilingualism and accepts queries in a given language so as to identify documents in any languages different from the source language. A machine translation system can then propose a translation of the identified documents into the user’s language. This field is the topic of much research at the moment and combines search engines with machine translation to obtain the most accurate result possible.

The main problem is at the level of the information need expressed by the query: Internet queries are, for the most part, composed of one or two keywords, which means that there is too little context to disambiguate keywords. To solve this problem, one possible strategy is to identify (by means of a dictionary) the degree of ambiguity of the words in the query and ask the user to better specify his or her query if necessary (interactively, if efficient strategies are available). An alternative approach involves showing documents answering the query directly (i.e., with no disambiguation stage), before asking the end user to evaluate their relevance according to information need.

The automatic analysis of the selected documents can then in turn be used to enhance the search to interactively obtain more accurate results in the target languages. Researchers in this field have developed several products that are primarily integrated in commercial solutions for “key corporate accounts” (large companies or administrations). In order to be efficient, the proposed solutions require the use of specialized lexicons depending on the target field.

Automatic Subtitling and Captioning

Automatic subtitling is an application that automatically produces the transcription of the audio portion of a program. It can be used in a monolingual environment but is also now used to provide live audio translations. Automatic subtitling can often be seen in noisy environments (train stations, airports, etc.) and is already deployed by mass media around the world. This type of application also makes mass media accessible to hearing-impaired people and to people who do not know the source language.

The quality of automatic speech transcription has allowed for these types of applications to exist in monolingual contexts. Today, the techniques used for automatic subtitling coupled with machine translation allow for the production of subtitles in various languages, live and without additional cost. The quality of the result, however, remains a problem, and applicable solutions are not yet deployed on a large scale.

Direct Translation in Multilingual Dialogue

Automatic speech translation is seen as a major opportunity by most information technology companies. Skype, for example, owned by Microsoft, developed a prototype that was incorporated into its communication platform. The trend is now widespread: the mobile messaging application WeChat has also announced the integration of a machine translation system. WeChat is first and foremost an interactive service of written messages, but it also allows for exchanging voice messages: these will be automatically translated in the same ways once the quality of the system is considered sufficient. Finally, Google introduced a voice translation application for mobile devices as part of its Google Translate system on Android platforms. All mobile operators work on these types of application to allow for “transparent” multilingual calls thanks to a direct translation (that is, to allow calls from callers speaking a foreign language without identifying that the interlocutor is indeed speaking in another language). However, it should be noted that the quality of these systems is unlikely to be sufficient to allow real conversations between humans in the short term: even though the quality of speech transcription improves regularly, the current error rate, combined with different translation modules, risks turning conversation into a dialogue of the deaf!

As for the American giant AT&T, it has developed Watson,6 a project that enables “speech-to-speech” applications, or live multilingual interaction with simultaneous translation. The application seems more capable of performing successfully than the applications described in the previous paragraph. In fact, in addition to traditional conversations between people, the American company targets audiointeractive services. In this context, translation is highly focused, since the goal of the system is mainly to manage access to large databases of company information. The system must be able to understand a query (expressed in some specific language) and provide an answer (a telephone number, for example) in the speaker’s language. This kind of application seems more achievable in the short term than multilingual conversations on any topic between people. Cell Phones and Connected Objects

New technologies and new applications now play a leading role in machine translation. “Speech-to-speech” applications are inseparable from the development of mobile phones. The majority of the applications we have seen are available today for cell phones (as applications), even if it is not really possible to have a direct multilingual conversation by telephone yet. For practical reasons, mobile phone applications are now geared more toward the direct translation of a few sentences in a conversation between people in the same room for example, but the eventual target is of course the direct translation of distant conversations through mobile phones.

Developers of such applications make use of all the possibilities of modern cell phones. To give one practical example, a specific application makes it possible to take a picture of a restaurant menu and immediately get the translation of the menu (though it seems the system is still unable to say whether the food will be good!). Through this specific application, one can see the convergence of different research fields: image analysis (in order to identify and extract text zones from the image), automatic character recognition, and machine translation.

Internet-connected objects (such as watches or glasses) will serve to support new applications in which multilingual speech will also be included. The Japanese company NTT Docomo has introduced a model of glasses with enhanced vision that incorporates machine translation features: the user can look at a text in Japanese and obtain a translation in English. At the moment, it is just a prototype whose quality and robustness have not been tested, but these examples illustrate the range of applications that exist for both text and speech.

Today these gadgets seem to suffer from a lack of interest from the general public, as a result of their high price and their uncertain positioning in terms of applications (Google Glass generated massive media coverage, only to be pulled from the market due to lack of commercial interest). The future of these objects is without a doubt more promising in professional contexts requiring people to work hands-free, for maintenance in particular (such as in the nuclear, aeronautic, and computer science fields). Other professional contexts could also provide opportunities, such as applications in medicine or sales or in the cultural domain (e.g., visits of museums with augmented reality devices).

Translation Aid Tools

While there has been renewed interest in machine translation since the 2000s, translation aid tools still lag behind. Companies now provide efficient specialized tools, especially “translation memories,” or databases where translators can find examples based on previous translations. Translation memories are being increasingly used and sometimes even imposed by companies on translators to ensure the coherence of translations.

Statistical translation models are based on the analysis of large bilingual corpora that can be considered a huge translation memory. However, we must not go too far with the analogy: the work of a human translator has little to do with how automatic systems operate. Another question is whether machine translation tools, which have made great progress in recent years, can help human translators in their work. Since most tools provide complete translations (and not merely fragments of translations), the only possible strategy consists in post-editing the translation to obtain a quality result. The outcome of this approach is mixed and difficult to generalize. It is necessary for the translations proposed by the machine to be of good quality so that the translator can work quickly and efficiently. The approach is only possible if the system has been tuned to fit the target domain and if the domain has a regular terminology and phraseology. A good example is the system developed for Environment Canada: the target was a very specific field (weather forecasts) with specific pieces of information (temperature lists for each city, etc.) to fill regular text templates. In this context, post-editing is very limited. In comparison, the translation of a technical text with a standard tool risks giving inoperable results.

Machine translation systems sometimes provide a translation that is just sufficient to get the gist of a document. This poor quality, which may be enough in some contexts, is generally very insufficient for a human translator. It also regularly happens that the proposed translation fragments are impossible to reuse. The solution is then to completely rephrase the sentence, and in this case the automatic system is simply useless. Consequently professional translators often prefer traditional work methods, which in the end are faster than automated methods. It is also worth recalling that the European Commission poured a lot of money into machine translation, but that, as mentioned previously, at most 5% of the translations produced were based on automatic tools. This shows that automatic translation is still far from being usable in real-world industrial or administrative contexts, if the target is a nearly professional quality translation.

Machine translation post-editing has, despite everything, recently become a full-fledged field of research. Conferences are currently organized around this single theme, showing the scientific and economic potential of the field. The interest is actually twofold. First, improving the productivity of translators: this involves efficient systems and strategies to make the best of the output of machine translation tools. Second, improving machine translation systems directly: this means being able to dynamically reuse end-user feedback to make the system evolve and propose more accurate translations in the future.

Beyond these experiments using standard machine translation tools, there is broad consensus today that translation aid tools should not supply a single translation at sentence level, but fragments of translations from which the translator can then choose. Trojanskij’s assisted translation environment (see chapter 4) remains in this regard a clear-sighted invention that has still never been explored in depth. We may also recall Bar-Hillel’s recommendations, or the 1966 ALPAC report: high-quality machine translation was seen as an illusion or at least an elusive goal for a long time. In the meantime, human translators need specific tools (and not standard commercial machine translation systems) to improve their productivity as well as the quality and homogeneity of the translations produced.

This is in fact a difficult problem, since no one knows exactly what would truly be helpful for a translator. Enhancing translation memories is the easiest path, since it displays the most relevant segments of texts according to the context of translation. But even this seemingly modest enhancement poses problems, insofar as continuously updating the displayed translation fragments can make the application relatively slow and burdensome to use. However, translation memory modules are widely used and remain the main application employed by professional translators in their work environments.

Notes

Data available online (site visited May 20, 2016); see http://ec.europa.eu/dgs/translation/faq/index_en.htm#faq_4/.
The rest is marginal and corresponds to work such as post-editing, translations of summaries, etc.
Moses is an open system for machine translation that implements some of the main algorithms of statistical machine translation. Moses incorporates another tool, Giza++, which implements various IBM models, and plenty of other algorithms have since been included in this platform, which is free online (http://www.statmt.org/moses).
http://www.hutchinesweb.me.uk/Compendium-16.pdf; site visited 160 September 15, 2014.
In an article from the magazine Le Point, September 2013; see http://www.lepoint.fr/editos-du-point/jean-guisnel/dimitrissabatakakis- systran-n-existerait-pas-sans-les-agences-derenseignements- americaines-18-09-2013-1732865_53.php.
See http://www.research.att.com/projects/WATSON/. 161