Lecture video: link

Machine translation

Machine translation is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language).

Pre-neural machine translation

During the cold war in the 1950s, researchers tried to translate Russian to English. The experiments did not go well. People didn't really understand either computer science or linguistics very well then. They used rule-based systems and dictionary look-up.

People resumed trying machine translation in the mid 1990s, trying to learn a probabilistic model from data. Suppose we want to find the best English sentence y given French sentence x: argmax[y]P(x|y). We can use Bayes rule to break this down into two components to be learned separately: argmax[y]P(x|y)P(y).

P(x|y) is a translation model focused on fidelity which gives a probability of words or phrase being translated between the two languages, without worrying about the word order in the target language. This is learned from parallel data (both languages). P(y) is a language model focused on fluency, which can be learned from monolingual data.

Translation model

Training data

First, we need a large amount of parallel data - pairs of human-translated sentences in the source and target language. In the modern world there are many places where parallel data is produced in large quantities, e.g. the European Union across European languages, the Canadian parliament in French and English, Hong Kong across English in Chinese, etc.

Alignment

We will introduce an alignment variable a, which tracks the alignment, i.e. the word-level correspondence between particular words in source sentence x and target sentence y.

Typological difference (e.g. SVO vs SOV, etc.) lead to complicated alignments. Some words have no counterpart, e.g. in French Japan is called "le Japon"; the "le" just goes away. There are also many-to-one alignments, e.g. "aboriginal people" in English translates to "autochtones" in French. And the reverse exists, one-to-many alignments such as "implemented" in English being translated to "mis en application". And finally there are many-to-many alignments, e.g. "don't have any money" in English is translated as "sont démunis", but there's not a good correspondence between these words directly. We could render the English sentence "The poor don't have any money." as "The poor are moneyless." which is aligned more closely to the French sentence "Les pauvres sont démunis."

Learning alignment

We learn P(x,a|y) as a combination of many factors:

the probability of particular words aligning (which also depends on the position in the sentence)
the probability of particular words having a particular fertility (number of corresponding words)
etc.

Alignments a are latent variables: they aren't explicitly specified in the data. They require the use of special learning algorithms like Expectation-Maximization for learning the parameters of distributions with latent variables.

We could enumerate every possible y and calculate the probability. This is exponential in the length of the sentence, which is way too expensive. For language models, we generated words one at a time, but we need to deal with the fact that things occur in different orders in the source and target sentence.

The answer is to impose strong independence assumptions in the model and use dynamic programming for globally optimal solutions (e.g. the Viterbi algorithm).

If we start with the German sentence "er geht ja nicht nach hause", we start with translations of individual words or multi-word phrases as "lego pieces" so to speak. Then we generate the translation piece by piece, as we did with the neural language models.

We start with an empty translation, then select one of the pieces to use. We can explore different possible pieces - we could translate "er" as "he", or "geht" as "are" as the first word in the target sentence. Starting a sentence with "he" is more likely. We also record which source words have already been used in the translation. Next we could translate "geht" to "goes", or "ja nicht" as "does not". We continue this search and prune algorithm until we have translated the entire sentence.

In the 1990s - 2010s, Statistical Machine Translation was a huge research field. The best systems were extremely complex and had many details we haven't mentioned here. The systems had many separately-designed subcomponents. There was a lot of feature engineering due to the need to design features to capture particular language phenomena. It required compiling and maintaining extra resources such as tables of equivalent phrases. And it took a lot of human effort to maintain, with repeated effort for every pair of languages. Nonetheless they were fairly successful - Google Translate launched in the mid-2000s and worked fairly well.

Neural Machine Translation

In 2014 neural machine translation took the world by storm. This specifically means building a single neural network that does translation end-to-end.

Sequence-to-sequence

The neural architecture for these models is called sequence-to-sequence ("seq2seq") and involves two neural networks. At runtime, the Encoder RNN produces an encoding of the source sentence; the final hidden state of the Encoder RNN represents the source sentence. We feed this hidden state into the Decoder RNN. The Decoder RNN is a Language Model which generates a target sentence, conditioned on the encoding.

Sequence-to-sequence is very versatile, used for many tasks besides machine translation, such as

Summarization (long text -> short text)
Dialogue (previous utterances -> next utterance)
Parsing (input text -> output parse as a sequence)
Code generation (natural language -> Python code)

The sequence-to-sequence model is an example of a Conditional Language Model. It predicts the next word of the target sentence y (a Language Modeling task), and is conditional because its predictions are also conditioned on the source sentence x.

Neural machine translation directly calculates P(y|x):

How to train a Neural Machine Translation system

First, we get a large parallel corpus. We take batches of source sentences and target sentences, encode the sentence with the encoder RNN, feed its final hidden state into the target RNN, and compare, word-by-word, the output of the decoder RNN with the truth from the corpus. We use the teacher forcing algorithm described earlier where we get a single word, then force the decoder to continue as though it had chosen the correct word.

Seq2seq is optimized as a single system, and backpropagation operates end-to-end, updating all the parameters of both the encoder and decoder model.

Multi-layer RNNs

RNNs we've been looking at are already "deep" in one dimension (they unroll over many time steps), but they are shallow in that there has been only a single layer of recurrent structure above our sentences. We can make them "deep" in another dimension by applying multiple RNNs, also known as "stacked RNNs". This allows the network to compute more complex representation. The lower RNNs should compute lower-level features, and the higher RNNs should compute higher-level features.

Lower-level features are basic things about words and phrases such as a word's part of speech, and named entity recognition, while higher-level features are things about the overall structure of a sentence, the sentiment of a phrase, idioms, etc.

At each point in time, after calculating a hidden representation, we feed that into the next layer of the LSTM, and feed that output into a third LSTM. The output of the encoder is a stack of three encodings (hidden state).

Multi-layer stacked LSTMs perform much better than single-layer LSTMs. High-performing RNNs are usually multi-layer (but aren't as deep as convolutional or feed-forward networks). A 2017 paper found that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN. Generally, 2 layers is a lot better than 1, 3 might be a little better than 2, and after that it flattens out or gets worse. For deeper RNNs we need skip-connections/dense-connections. Transformer-based networks such as BERT are deeper, like 12 or 24 layers.

Decoding

The simplest way of decoding is to take the argmax over words at each step, then moving to the next word. This is greedy decoding. There is no way to undo decisions.

Input: "il a m'entaré" (French) ("he hit me with a pie") Greedy decoding: -> he ___ -> he hit ___ -> he hit a ___ (no going back now!)

Ideally, we want to find a (length T) translation y that maximizes P(y|x):

We could compute all possible sequences y, but that will generate an exponential number of translations: far too expensive.

Beam search decoding

The code idea is that on each step of the decoder, we keep track of the k most probable translations (hypotheses). K is the beam size (in practice around 5-10).

A hypothesis (the prefix of a translation) is scored - it score is the log probability using the LM. It's not guaranteed to find the optimal solution, but it's much more efficient than exhaustive search. For each word in the hypothesis, we calculate the score as the sum of the scores of the words before, plus the score for the current word (all of which are negative due to log).

In greedy decoding, we decode until the model produces an token. In beam search decoding, hypotheses may terminate different length. We a hypothesis produces , we put it aside and continue exploring other hypotheses via beam search. We continue until either we reach some time step T (T is a pre-defined cutoff), or we have at least n hypothesis (n is a pre-defined cutoff).

We now want to select one hypothesis. Naively we could take the highest scored one, but longer hypotheses will have lower scores. Therefore we normalize by length - i.e. dividing the probability by the length of the sentence.

Compared to statistical machine translation, neural machine translations has many advantages:

Better performance
- More fluent
- Better use of context
- Better use of phrase similarities (phrases that mean approximately the same thing(
A single neural network to be optimized end-to-end
- No subcomponents to be individually optimized (which are not optimal when combined)
Require much less human engineering effort
- No feature engineering
- Same method/code for all language pairs

Neural machine translation also has some disadvantages:

Less interpretable
- Hard to debug
Difficult to control
- Can't easily specify rules ore guidelines for translation (e.g. more casual style)
- Safety concerns (can't predict what will be said)

Evaluation

BLEU (Bilingual Evaluation Understudy) compares the machine-written translation to one or several human-written translations and computes a similarity score based on

n-gram precision (usually for 1, 2, 3, and 4-grams)
plus a penalty for too-short system translations

BLEU is useful but imperfect. There are many valid ways to translate a sentence. A good translation may get a poor BLEU score because it has low n-gram overlap with the human translation.

Progress on machine translation over time:

In the early 2010s little progress was being made with phrase- and syntax-based SMT, and most of that came from training models on more and more data. (We haven't learned about syntax-based SMT.) 2014 was the first modern attempt to build a neural MT system. NMT went from a fringe research attempt in 2014 to the leading standard method in 2016. In 2014 the first seq2seq paper was published; by 2016 Google Translate had switched from SMT to NMT; and by 2018 so had everyone else. SMT systems, built by hundreds of engineers, was outperformed by NMT systems trained by a small group of engineers in a few months.

Machine translation is still not solved, though. Many difficulties remain:

Out-of-vocabulary words
Domain mismatch between train and test data (e.g. news articles vs text messages)
Maintaining context over longer text
Low-resource language pairs
Failures to accurately capture sentence meaning
Pronoun (or zero pronoun) resolution errors
Morphological agreement errors
Biases in training data, e.g. non-gendered sentences translated into English applies stereotypical biases
Hallucinating strange outputs, given uninterpretable inputs

NMT research continues to thrive, with many, many improvements to the "vanilla" seq2seq NMT system we've just learned about. But one is so integral that it is the new vanilla...

Attention

One problem with the seq2seq architecture we've seen is that the final encoder RNN hidden state has to contain all information about the source sentence - an information bottleneck. We'd like to get more information from the source sentence while running the decoder RNN - kind of like how a human would look back and forth at the source sentence.

Attention provides a solution to the bottleneck problem. Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.

We compare the hidden state of the decoder with the hidden state of the encoder at each position, and generation an attention score (similarity score, e.g. dot product), then take softmax to turn the scores into a probability distribution over the encoder hidden states. The attention output is a weighted average of the encoder RNN's hidden states (weighted by that distribution), and this is concatenated with the encoder's hidden state to generate the next word.

Lecture 7 - bancron/stanford-cs224n GitHub Wiki

Machine translation

Pre-neural machine translation

Translation model

Training data

Alignment

Learning alignment

Neural Machine Translation

Sequence-to-sequence

How to train a Neural Machine Translation system

Multi-layer RNNs

Decoding

Beam search decoding

Evaluation

Attention

⚠️ GitHub.com Fallback ⚠️

Lecture 7 - bancron/stanford-cs224n GitHub Wiki

Machine translation

Pre-neural machine translation

Translation model

Training data

Alignment

Learning alignment

Neural Machine Translation

Sequence-to-sequence

How to train a Neural Machine Translation system

Multi-layer RNNs

Decoding

Beam search decoding

Evaluation

Attention

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️