Lecture 8 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

The majority of this lecture covers the final project, which doesn't apply to us.

Attention

Recall our sequence-to-sequence model with attention from the previous lecture. We use the encoder as before; at each time step through the decoder, we will compute a new hidden representation, and use that hidden representation to look back at the encoder. We work out some similarity function between the encoder hidden states and the decoder hidden state to find attention scores (probability weights). We compute an attention distribution, and use a weighted average of the encoder RNN hidden states to get a new attention output layer, and use both that and the decoder RNN's output to calculate the next output word.

Why do we need both the encoder and decoder RNNs trained separately rather than using a single RNN? One reason is that for machine translation, one sequence is in the source language and the other is in the target language. The argument of the LSTM is that it's good at maintaining history through multiple steps, but it's been shown that attention is more effective at addressing elements of the past state.

Why not have self-attention from the decoder RNN back to previous states of itself? That's actually a good idea, and will be covered later in this series.

Attention in equations

We have the encoder hidden states. On time step t we have the decoder hidden state. We want to get attention scores how much the decoder pays attention to each state of the encoder. We then put those through softmax to get a probability distribution. We construct a new vector which is a weighted sum of the encoder hidden states to get the attention output. Finally, we concatenate the attention output with the decoder hidden state and use that in the seq2seq model.

This is the equation version of the algorithm we have been looking at visually so far:

Efficacy of attention

Attention significantly improves neural machine translation performance. It's very useful to allow the decoder to focus on certain parts of the source.

Attention provides a more "human-like" model of the machine translation process. The RNN can look back at the source sentence while translating, rather than needing to "remember" it all.

Attention solves the bottleneck problem, because we allow the decoder to look directly at the source.

It also mitigates the vanishing gradient problem by providing a shortcut to faraway states.

Attention also provides some interpretability to seq2seq models. By inspecting the attention distribution, we can see what the decoder was focusing on - (soft) alignment. This is neat because we never explicitly trained an alignment system; the network learned alignment between the source and target languages by itself.

Attention variants

Commonly we have some values used as our memory, and a query vector.

Attention involves:

Computing the attention scores
Taking the softmax to get an attention distribution
Using the attention distribution to take the weighted sum of values, thus obtaining the attention output (sometimes called the context vector).

There is more than one way to compute the attention score. Let h1...hn be the hidden states from the encoder RNN, and s be the current state of the decoder RNN.

The simplest way is to use the dot product. But does the entire hidden state contain information about what to attend to? LSTMs are passing along information from the past to use in the future, information about what output to generate next, and information that serves as a query key for attention. We may only want to use some of that information to calculate the attention score.

Another idea (multiplicative attention) puts an extra matrix W in the middle of the product, which lets us learn which parts of s and which parts of h to pay attention to. But does the W matrix have too many parameters? We're putting in d^2 new parameters (d being the dimension of the hidden states), which lets us combine any element of s with any element of h. We might like it to have fewer parameters.

Reduced rank multiplicative attention lets us choose low rank matrices U and V which are much smaller. This is exactly what happens in transformer models.

Additive attention, the original attention formula, does something much more complicated, which works out to using a neural network to calculate attention scores.

Attention is a general Deep Learning technique

We can use attention in many architectures (not just seq2seq), and many tasks (not just machine translation).

A more general definition of attention: given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. We sometimes say that the query attends to the values.

The weighted sum gives is a selective summary of the information contained in the values, where the query determines which values to focus on. Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). Attention has become the powerful, flexible, general way that pointer and memory manipulation works in deep learning models, like RAM.

Much of the progress in deep learning in recent years is ideas from the 80s and 90s which were given new life in the 2010s. Attention is a new idea from after 2010!