Lecture 9 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture covers self-attention and transformers.

Circa 2016, the de facto strategy in NLP was to

encode sentences with a bidirectional LSTM (e.g. the source sentence in a translation)
define your output (the parse, sentence, summary, etc.) as a sequence, and use an LSTM to generate it
use attention to allow flexible access to the "memory" of the LSTM

Today, we're not trying to motivate entirely new ways of looking at problems (like machine translation). Instead, we're trying to find the best building blocks to plug into our models and enable broad progress.

Issues with recurrent models

Linear interaction distance

RNNs are unrolled "left-to-right". This encodes linear locality: a useful heuristic! Nearby words often affect each other's meanings.

The problem is that RNNs take O(sequence length) steps for distant word pairs to interact.

This makes it hard to learn long-distance dependencies (due to gradient problems). The linear order of words is "baked in" to the model; we already know linear order isn't the right way to think about sentences.

Lack of parallelizability

An RNN's forward and backward passes have O(sequence length) unparallelizable operations. GPUs can perform a bunch of independent computations at once. But future RNN hidden states can't be computed in full before past RNN hidden states have been computed. This inhibits training on very large datasets. This lack of parallelizability is an inherent part of the recurrent architecture.

If not recurrence, what can we use?

Word windows

A word window model aggregates local contexts. This is also known as 1D convolution, which we will cover later. The number of unparallelizable operations does not increase in proportion to the sequence length.

What about long-distance dependencies? Stacking multiple word window layers allows interaction between farther words. The maximum interaction distance = sequence length / window size. If your sequences are too long, you'll just ignore long-distance context.

Attention

Recall that attention treats each word's representation as a query to access and incorporate information from a set of values. We saw attention from the decoder to the encoder; today we'll think about attention within a single sentence.

In this model, all words in the second layer attend to all words in the previous layer. We can't parallelize in depth, but we can in time (sequence length). The maximum interaction distance is O(1), since all words interact at every layer.

Self-attention

Attention operations on queries, keys, and values. For now assume each query/key/value has the same dimensionality d, and the number of queries, keys, and values is the same. (In practice the number of queries can differ from the number of keys and values). In self-attention, the queries, keys, and values are drawn from the same source. We can use the same vectors for each of these (the output of a single word from some layer in the RNN).

First we calculate the (dot product) self-attention operation. These are unbounded scalar values.

Then we compute the attention weights to get a probability distribution.

Finally, the output is the weighted sum of the values. There is one of these per query.

Q: If we're connecting everything to everything else, how is this different from a fully connected layer? A: The interaction weights are dynamic as a function of the values. In a fully connected layer, we slowly learn the weights during training. In attention, the interactions between the key and the query vectors, which depend on the actual content, vary by time. The attention weights are allowed to change as a function of the input.

Self-attention as an NLP building block

Let's stack self-attention layers as we have stacked LSTM layers in the past. Can this be a drop-in replacement for recurrence?

No. There are a few issues.

No inherent notion of order

First, self-attention is an operation on sets. It has no inherent notion of order.

We need to encode the order of the sentence in our keys, queries, and values. We could represent each sequence index as a vector p of dimensionality d. (Don't worry about what the p[i] are made of yet.) It will be easy to incorporate this information into our self-attention block: just add the p[i] to our inputs. (You could concatenate them instead, but people mostly just add.)

Sinusoidal position representations

Concatenate sinusoidal functions of varying periods.

Pros: the periodicity indicates that maybe "absolute position" isn't as important. Maybe it can extrapolate to longer sequences as periods restart.

Cons: It's not learnable. And the extrapolation doesn't really work.

Position representation vectors learned from scratch

Learned absolute position representations: Let all p[i] be learnable parameters.

We will learn a matrix p of dimensionality d times T (sequence length).

Pros: Flexibility: each position gets to be learned to fit the data.

Cons: Definitely can't extrapolate to indices beyond 1, ..., T.

Most systems use this! Sometimes people try more flexible representations of position: relative linear position attention; dependency syntax-based position.

No nonlinearities

There are no elementwise nonlinearities in self-attention; stacking more self-attention layers just averages the value vectors.

Solution: add a feed-forward network to post-process each output vector - each token. The intuition is that the feed-forward network processes the result of attention. We apply the same feed-forward network to each output.

"Looking at the future" when predicting a sequence

In language modeling or machine translation, we're trying to predict words in the future. With a recurrent model, we can just not unroll the LSTM further than the current word. To use self-attention in decoders, we need to ensure we can't peek at the future.

One idea: at each timestamp, change the set of keys and values to only include past words. This is inefficient and can't be parallelized.

Instead, we mask out the attention to future words by artificially setting their attention scores to negative infinity. We do this at each layer of the decoder.

The Transformer model

Let's look at the Transformer Encoder and Decoder Blocks at a high level.

We start with our (input) word embeddings and add in the position representations. We have a sequence of Transformer Encoder blocks. For the output sequence, we again have the word embeddings and position representations, and a Transformer Decoder. The last layer of encoders is used in each layer of the decoder: the decoder attends to encoder states. Then the decoder outputs predictions.

What's left in a Transformer Encoder Block that we haven't covered?

Key-query-value attention. How do we get the k, q, v vectors from a single word embedding?
Multi-headed attention: Attend to multiple places in a single layer.
Tricks to help with training
- Residual connections
- Layer normalization
- Scaling the dot product
- These tricks don't improve what the model is able to do; they help improve the training process, which is equally important.

We saw that self-attention is when keys, queries, and values come from the same source. The Transformer encoder does this in a particular way.

Key-Query-Value Attention

Let x1 ... xT be input vectors to the Transformer encoder - vectors of dimension d.

The keys, queries, and values are

ki = K*xi, where K is the d x d key matrix.
qi = Q*xi, where Q is the d x d query matrix.
vi = V*xi, where V is the d x d value matrix matrix.

These matrices allow different aspects of the x vectors to be used/emphasized in each of the three roles.

Let X = [x1;...;xT] of dimension T x d be the concatenation of input vectors.

First, note that XK, XQ, and XV are also of dimension T x d.

The output tensor is defined as output = softmax(XQ(XK)^T) x XV.

N.B. X^T means X transposed. x[i] is x with subscript i.

Multi-headed self-attention

What if we want to look at multiple places in the sentence at once? For word i, self-attention "looks" where x[i]^T Q^T K x[j] is high, but maybe we want to focus on different j for different reasons. Those are the i, j pairs that end up interacting with each other.

We will define multiple attention "heads" through multiple Q, K, V matrices. Let Q[l], K[l], V[l] be matrices of dimension d x d/h, where h is the number of attention heads, and l ranges from 1 to h. They will still apply to the X matrix, but transform it to a smaller dimensionality d x d/h.

Each attention head performs attention independently. output = softmax(X Q[l] K[l]^T X^T) * X V[l], where output[l] is of dimensionality d/h.

The outputs of all the heads are combined. output = Y[output[1];...;output[h]] (concatenate all the heads together) where Y is of dimensionality d x d.

Each head gets to "look" at different things and construct value vectors differently. We're still doing the same amount of computation as before.

Pictorially:

Training tricks

Residual connections

Residual connections are a trick to help models train better. Instead of X[i] = Layer(X[i-1]) where i represents the layer, we let X[i] = X[i-1] + Layer(X[i-1]). The layer is equal to the previous layer plus the function over the previous layer.

So we only have to learn "the residual" from the previous layer. Intuition: we should be learning only how layer i should be different from layer i-1.

vs.

The gradient of the second one is much better. Residual connections are thought to make the "loss landscape" considerably smoother.

Layer normalization

Layer normalization is a trick to help models train faster. At different times during the forward pass of training, there is a lot of uninformative variation which can harm training. The idea is to cut down on uninformative variation in hidden layer vector values by normalizing to unit mean and standard deviation within each layer. LayerNorm's success may be due to its normalizing gradients.

epsilon is a small value in case the standard deviation goes to 0. * is the Hadamard product. The gain and bias may not be necessary, but they're frequently used.

This is very important to transformers - unless we do this, they don't train well.

Scaled dot product

When the dimensionality d becomes large, the dot products between vectors become large. Because of this, inputs to the softmax function can be large, making the gradients small - the softmax is very peak-y, putting most of the probability on a few things, and zeroing out the attention to everything else, and thus their gradients.

Solution: Divide all scores by sqrt(d/h).

Why do this even if we are going to use layer norm? Nothing will get too large or too small in these vectors. However, this comes along a bit too late in the pipeline.

Why sqrt(d/h) instead of some other function of d/h? The dot product grows with O(sqrt(d)) - check out the original paper to answer this question.

Updated encoder-decoder

To sum up, here is the new Encoder block using these tips:

And the Decoder block:

Multi-head cross-attention, the connection to the transformer encoder block, has multiple attention functions occurring. The residual + LayerNorm comes after each step to help the gradients pass along.

Cross-attention (details)

We've seen cross-attention from the decoder to the encoder. (We don't attend to encoder blocks other than the last one.)

Let h[1] ... h[T] be output vectors from the Transformer encoder, each of dimensionality d.
Let z[1] ... z[T] be output vectors from the Transformer decoder, each of dimensionality d.
Keys and values are drawn from the encoder (like a memory):
- k[i] = K h[i], and v[i] = V h[i]
The queries are drawn from the decoder, q[i] = Q z[i].

Very similarly to the key-query-value attention section above,

Let H = [h1;...;hT] of dimension T x d be the concatenation of encoder vectors.
Let Z = [z1;...;zT] of dimension T x d be the concatenation of decoder vectors.
output = softmax(ZQ(HK)^T) x HV

Results with transformers

Machine Translation from the original Transformers paper: they got higher BLEU scores, and also more efficient training!

Summarization: seq2seq attention got perplexity of 5.04, and a series of transformers architectures improved and improved upon that down to 1.90, leading to RNNs falling out of practice.

Transformers allow pretraining, which we'll go over next time. Transformers' parallelizability allows for efficient pretraining, and have made them the de-facto standard.

There's a popular aggregate benchmark, GLUE, which has transformers winning at every slot eventually.

What we would like to fix about the Transformer

Quadratic compute in self-attention (today)
- Computing all pairs of interactions means our computation grows quadratically with the sequences length.
- For recurrent models, it only grew linearly!
Position representations
- Are simple absolute indices the best we can do to represent position?
- Relative linear position attention
- Dependency syntax-based position

Quadratic compute

The total number of operations grows as O(T^2 d), where T is the sequence length and d is the dimensionality.

In practice we may set a bound to T = 512 or so. But what if we want to work on a long document with T = 10,000 words?

Can we build models like transformers without paying all the O(T^2) all-pairs self-attention cost?

Example: Linformer. We map the sequence length dimension to a lower-dimensional space for values and keys.

Another example: BigBird. Replace all-pairs interactions with a family of other interactions, like local windows, looking at everyting, and random interactions.

The normal transformer is still the most popular variant currently. It was thought that it might still be better to use RNNs on smaller data problems, but with pretraining that may not be the case anymore. There are some RNN use cases that depend on recurrence, but they are rare.