Lecture 10 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture covers pretraining Transformers.

Subword modeling

So far we've made an assumption about the language's vocabulary. We assume there is a fixed vocabulary of tens of thousands of words, and any novel word seen at test time are mapped to a single UNK token. We can't interpret unseen words such as variations ("taaaaasty"), misspellings ("laern"), or novel items ("Transformerify").

Finite vocabulary assumptions make even less sense in languages with complex morphology. Swahili verbs can have hundreds of conjugations, each encoding a wide variety of information (tense, mood, definiteness, negation, information about the object, ...).

The byte-pair encoding algorithm

Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level - parts of words, characters, bytes. The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens). At training and testing time, each word is split into a sequence of known subwords.

Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.

Start with a vocabulary containing only characters and an "end-of-word" symbol.
Using a corpus of text, find the most common adjacent characters "a,b"; add "ab" as a subword.
Replace instances of the character pair with the new subword; repeat until the desired vocab size is reached.

Example: starting characters {a, b, ..., z}. Ending vocab: {a, ..., z, ..., apple, app##, ..., ##ly, ...}.

Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components. We will never see an UNK, since in the worst case the word will be split into as many subwords as they have characters. This could result in a much longer sequence.

This encoding scheme was originally used in NLP for machine translation. Now a similar method (WordPiece, not covered here) is used in pretrained models.

To go back to our previous examples, "taaaaasty" -> "taa## aaa## sty"; "laern" -> "la## ern##"; "Transformerify" -> "Transformer## ify".

The pretrained transformer does not distinguish between words and subwords when doing its self-attention operations.

The "##" in "sub##" indicates that it is the first part of a larger word, as distinguished from the complete word "sub".

Motivating model pretraining from word embeddings

"You shall know a word by the company it keeps" (J. R. Firth 1957). This quote as a summary of distributional semantics, and motivated word2vec.

But: "... the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously." (J.R. Firth 1935).

Consider "I record the record": the two instances of record mean different things. They would be given the same word embedding, though.

Circa 2017, we would start with pretrained word embeddings (no context!). Then we would learn how to incorporate context in an LSTM on Transformer while training on the task (supervised training for a task such as translation, sentiment, or question answering).

The word embeddings would take up some of the parameters of the model, and then the later (learned) LSTM states would constitute the other parameters, along with the learned output function.

This puts the onus on our downstream data to be sufficient to teach the contextual aspects of language. If we have only a little labeled data for fine-tuning our downstream task, we are expecting a lot from a model with randomly initialized parameters.

Pretraining whole models

In modern NLP (circa this class in 2021), all (or almost all) parameters are initialized via pretraining. Pretraining methods hide parts of the input from the model, and train the model to reconstruct these parts.

In word2vec we have an individual word that knows itself, the embedding for the center word, with all of its neighbors hidden from it. We ask the center word to predict its neighbors.

With full model pretraining, we don't learn the embedding for a single word, but instead the embeddings for longer word sequences. These are all pretrained jointly.

This has been exceptionally effective at building strong:

representations of language that map similar word sequences to similar encodings
parameter initializations for strong NLP models, starting with these pretrained models
probability distributions over language that we can sample from (e.g. in language modeling)

What pretraining can learn

What can we learn from reconstructing the input - hiding part of the input, training a model, then reconstructing the parts that we hid?

"Stanford University is located in _____, California." We expect the loss function to train the model to predict Palo Alto.

"I put ___ fork down on the table." This one is underspecified - it could be "the", "my", "his", etc. We are learning syntactic categories of words that could appear in this context.

"The woman walked across the street, checking for traffic over ___ shoulder." "Her", a coreference to "the woman", is quite likely.

"I went to the ocean to see the fish, turtles, seals, and ___." The model could learn a lexical semantic category of things in the set of fish, turtles, and seals in the context of the ocean.

"Overall, the value I got from the two hours watching it was the sum total of the popcorn of the drink. The movie was ___." We might predict "bad" - learning about sentiment.

"Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ___." We are learning about locality, and the relationships between actors and locations.

"I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ___." In theory we would need to learn the formula for the Fibonacci sequence (although in practice long examples of the Fibonacci sequence may appear in the training data).

We need really a lot of data to train these extremely large models well. Interestingly, in fact models like BERT (upcoming) are underfitting rather than overfitting.

Models also learn, and can exacerbate racism, sexism, and all manner of bad biases.

Transformer Encoder-Decoders

Let's review our Transformer Encoder-Decoder from last time. Our Encoder takes a sequence of subwords. Each subword gets a word embedding and each index gets a position embedding. (Our sequence can only be at most a finite length such as 512.) The Encoder is comprised of submodules - Multi-Head Attention, Residual + LayerNorm, Feed-Forward, and another Residual + LayerNorm. This is then stacked onto another identical Transformer Encoder block - e.g. 16 deep.

The Transformer Decoder is comprised of Masked Multi-Head Self-Attention (where we can't look at the future), Residual + LayerNorm, Multi-Head Cross-Attention (which looks back to the last layer of the Transformer Encoder), another Residual + LayerNorm, Feed-Forward, and a third Residual + LayerNorm. If we don't have an Encoder, we take out the Cross-Attention and its Residual + LayerNorm.

Pretraining through language modeling

Recall the language modeling task. Model p(word t | words 1 ... t-1). The input is a large amount of (unlabeled) text. There is a lot of data for this in English.

Pretraining through language modeling: we train a neural network to perform language modeling, and save the network parameters.

Step 1: Pretrain (on language modeling). Lets of text; learns general things.

Step 2: Finetune (on our task). Not many labels; adapts to the task.

Why should pretraining and finetuning help, from a "training neural nets" perspective? We are trying to minimize the loss function in two steps. First we get parameters θ-hat of the neural network (word embeddings, position embeddings, etc.) by finding parameters min-θ to minimize the pretraining loss. Then, finetuning approximates min-θ, starting at the previous parameters θ-hat.

The training may matter because, in practice, stochastic gradient descent sticks (relatively) close to θ-hat during finetuning. So, maybe the finetuning local minima which are near θ-hat tend to generalize well! And/or, maybe the gradients of finetuning loss near θ-hat propagate nicely!

Model pretraining three ways

Decoders

These are language models - what we've seen so far. They are nice to generate from, but can't condition on future words.

When using language model pretrained decoders, we can ignore that they were pretrained to model p(w[t]|w[1...t-1]). We can finetune them by training a classifier on the last word's hidden state.

The hidden state h[1], ..., h[T] = Decoder(w[1], ..., w[T]). (w[1] ... w[T] are "words" in the input, really subwords.)

The very last layer y = AH[t] + b, where A and b are randomly initialized and specified by the downstream task (e.g. sentiment), has not been pretrained. We backpropagate the gradients through the entire pretrained network to finetune all of these parameters.

The contract with the pretrained model was to model probability distributions. But we can keep the parameters without treating it as a probability distribution, and use it as a generic decoder that was trained in some useful way on we-don't-care-what.

Another natural way to interact with decoders is to use one as a generator, finetuning their pθ(w[t]|w[1...t-1]). This is helpful for tasks like dialogue (using the dialogue history as the context), or summarization (where the context is the document).

Again h[1], ..., h[T] = Decoder(w[1], ..., w[T]). w[t] = AH[t] + b.

The last (linear) layer of the network, unlike before, has been pretrained.

Generative Pretrained Transformer (GPT)

2018's GPT was a huge success in pretraining a decoder.

Tranformer Decoder with 12 layers.
768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
Subword vocabulary with 40,000 merges.
Trained on BooksCorpus: over 7000 unique books.
- Contains long spans of contiguous text, for learning long-distance dependencies.
The acronym "GPT" never showed up in the original paper; it could stand for "Generative PreTraining" or "Generative Pretrained Transformer".

How do we format inputs to our decoder for finetuning tasks?

The Natural Language Inference task is to label pairs of sentences as entailing/contradictory/neutral.

Premise: The man is in the doorway.
Hypothesis: The person is near the door.
Answer: entailment

Here's roughly how the input was formatted, as a sequence of tokens for the decoder.

[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]

The linear classifier is applied to the representation of the final [EXTRACT] token to produce one of the labels "entailing/contradictory/neutral". The task specification is changed to match the pretrained architecture.

Encoders

These get bidirectional context - all pairs of interactions; can condition on the future. But how do we pretrain them? We can't pretrain them as language models because we can condition on the future - the loss is 0 because we can just look at the future.

Idea: Masked LM. We replace some fraction of the words in the input with a special [MASK] token, and predict these words.

"I [MASK] to the [MASK]". We predict "I went to the store", and calculate loss only for the indices which were masked.

BERT: Bidirectional Encoder Representations from Transformers

In 2018, Devlin et al. proposed the "Masked LM" objective and released the weights of a pretrained Transformer, a model they labeled BERT. How this worked:

Predict a random 15% of (sub)word tokens.
- Replace the input work with [MASK] 80% of the time.
- Replace the input word with a random token 10% of the time.
- Leave the input word unchanged 10% of the time (but still predict it).

Why? It doesn't let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time.)

The model sees "I pizza to the [MASK]". "pizza" has been replaced; "to" is unchanged; "[MASK]" is masked. The model has to predict those three tokens as "went", "to" and "store" respectively. In the end, we won't care about the representations of MASK, only the other tokens.

The pretraining input to BERT was two separate contiguous chunks of texts.

"[CLS] my dog is cute [SEP] he likes play ##ing [SEP]".

Those two short sentences would really be much longer contiguous chunks of text, so the entire input would be 512 words. The second chunk of text will sometimes be the text that directly follows the first in the dataset, and sometimes have it be randomly sampled from somewhere else. The model should predict whether it's the first case or the second - next sentence prediction. Later work argued that this "next sentence prediction" is not necessary - we would rather have an input sentence that's twice as long to learn long-distance dependencies.

BERT-large had large models at the time. 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params. It was trained on English Wikipedia (2,500 million words).

Pretraining is expensive and impractical on a single GPU. BERT was pretrained with 64 TPU chips for a total of 4 days. Finetuning is practical and common on a single GPU. "Pretrain once, finetune many times."

The Hugging Face Transformers library makes this very easy.

We evaluate these models on things like paraphrase questions (QQP), natural language inference over question answering data (QNLI), determining whether sentences are grammatical (CoLA), semantic textual similarity (STS-B), etc.

Limitations of pretrained encoders

If our task involves generating sequences, consider using a pretrained decoder. Pretrained encoders don't naturally lead to a nice autoregressive (1-word-at-a-time) generation methods.

Extensions of BERT

You'll see a lot of BERT variants like RoBERTa, SpanBERT, etc.

RoBERTa: mainly just train BERT for longer and remove the next sentence prediction.
SpanBERT: masking contiguous spans of (sub)words makes a harder, more useful pretraining task.

Encoder-Decoders

These have the good parts of decoders and encoders. What's the best way to pretrain them?

For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.

h[1],...,h[T] = Encoder(w[1],...,w[T]). Predict on none of these.
h[T+1],...,h[2T] = Decoder(w[1],...w[T],h[1],...,h[T]).
y[i] = Aw[i] + b where i > T.

Here we're performing language modeling on the second half of the sequence only. The encoder portion benefits from the bidirectional context; the decoder portion is used to train the whole model through language modeling.

What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Original text: "Thank you for inviting me to your party last week".

Input to encoder: "Thank you me to your party week." Note: the token says something is missing, but not how many subwords it spans. * The decoder predicts " for inviting last ".

We don't need to change the LM/decoder task, just change the form of the output of the decoder. This is an improvement over previous models.

A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving the knowledge from its parameters relatively well. At 11 billion parameters, it performed as well as some systems that were allowed to look at more than just the parameters.

GPT3, very large models, and in-context learning

So far we've interacted with pretrained models in two ways:

Sample from the distributions they define (maybe providing a prompt).
Fine-turn them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts/history.

The largest T5 model had 11 billion parameters; GPT-3 has 175 billion parameters. (N.B. this class was in 2021.)

Given an untuned model,

Input (within a single Transformer decoder context): "thanks -> merci; hello -> bounjour; mint -> menthe; otter ->"
Output: "loutre"

This same idea works with addition problems, or spelling errors.