Lecture 12 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture covers Natural Language Generation. Lecturer: Antoine Bosselut.

Natural language generation

Natural Language Generation (NLG) is a sub-field of NLP focused on building systems that automatically produce coherent and useful written or spoken text for human consumption.

Machine translation is the classical example of an NLG task. Dialog systems, like Siri and Alexa, are another example that contain neural NLG systems. Another example is document summarization, which gathers data from multiple sources and generates a meaningful summary, e.g. summarizing emails, meetings, or scientific papers.

These modalities aren't limited to text-in, text-out. The original formulation was data-to-text generation, starting from a table, knowledge graph, or data stream. There has also been a lot of recent work in visual description, e.g. a paragraph-level description of a photo, or a stream of descriptions of a video.

We have also seen NLG systems being developed in creative generation, e.g. helping to write short stories, blog posts, poems, or even full books.

What is NLG?

Any task involving text production for human consumption requires natural language generation. Deep Learning is powering next-gen (as of 2021) NLG systems.

Formalizing NLG: a simple model and training algorithm

Recap

In autoregressive text generation models, at each time step t, our model takes a sequence of tokes of text {y}<t and outputs a new token y-hat[t] conditioned on the input. The next token is fed back into the input to continue generation.

At each time step t, our model computes a vector of scores for each token in our vocabulary S (of dimensionality V).

Then, we compute a probability distribution vector P over w using these scores.

(Sometimes the w is omitted from this equation, but it is still the probability that y[t] = w.)

At inference time, our decoding algorithm defines a function to select a token from this distribution.

We train the model to minimize the negative log likelihood of predicting the next token in the sequence.

This is just a multi-class classification task where each word w in V is a class. This algorithm is often called teacher forcing.

The * token, e.g. y*, refers to the gold (ground truth) token.

Then we can compute gradients w.r.t. the summed loss term for each parameter in the model to update the parameters.

Decoding

Greedy methods

Recall argmax decoding from machine translation: select the highest probability token.

Recall beam search, also, a greedy algorithm, but with a wider search over candidates.

These greedy methods work well, but they often get into a repetitive loop.

Why does repetition happen? If we repeat the same phrase, the negative log likelihood for those tokens goes down and down.

This kind of makes sense: if you say "I'm tired" 15 times in a row, very likely you will say "I'm tired" the 16th time, but it's not a useful output for this task.

Note that this is less of a problem for RNN architectures than Transformer LMs - the LSTM flattens out after a while. The removal of the RNN temporal bottleneck of looking at a past state means that the Transformer is more prone to repetitive behavior when using greedy algorithms to decode.

How can we reduce repetition?

Hacky but surprisingly effective:

Don't repeat n-grams at decoding time.

More complex:

Minimize embedding distance between consecutive sentences. This doesn't help with intra-sentence repetition.
Coverage loss that penalizes attending to the same tokens over time: prevents attention mechanism from attending to the same words.
Unlikelihood objective: penalize generation of already-seen tokens.

Are greedy methods reasonable? Beam-search decoded text of course tends to be very high probability all the time, which contrasts strongly with human text.

Stochasticity

To match the uncertainty of human language, we will sample randomly.

One problem is that vanilla sampling makes every token in the vocabulary an option. Even if most of the probability mass in the distribution is over a limited set of options, the tail of the distribution could be very long. Many tokens are probably irrelevant in the current context.

We are giving these irrelevant tokens individually a tiny chance to be selected, meaning that as a group they have a high chance to be selected.

Solution: top-k sampling. Only sample from the top k tokens in the probability distribution. Common values of k are 5, 10, 20, 100: this is a hyperparameter we can tune. Increasing k has more diverse/risky outputs; decreasing it has more generic/safe outputs.

Top-k sampling can cut off too quickly. If the distribution is very flat, we might not want to truncate a bunch of reasonable tokens. Conversely we might want to cut the distribution off with fewer than k suitable tokens.

Solution: top-p sampling: sample from all tokens in the top p cumulative probability mass (i.e. where mass is concentrated). This varies k depending on the uniformity of P[t].

Temperature scaling applies a linear coefficient to each token before the softmax. Softmax temperature scaling can apply to any softmax decoding scheme.

We may want to change the model's distribution if it's not well-calibrated to the task.

Solution #1: Re-balance P[t] using retrieval from n-gram phrase statistics. K-nearest-neighbors LMs.

Cache a database of phrases from your training corpus (or some other corpus)
At decoding time, search for the most similar phrases in the database. Compute a distribution over the most similar phrases and add statistics about these phrases to the model: re-balance P[t] using the induced distribution P[phrase] over words that follow these phrases.

How do we know what to cache? In this work they actually cached everything, but pruned the number of phrases used to make the phrase distribution (rather than using the entire corpus).

Can we re-balance the LM's distribution to encourage other behaviors? For example, when changing a trained model to a different domain, we may not have a good database of phrases to draw from. We can define an external objective using a classifier called a discriminator (called attribute model in the original paper). This approximates some property that we'd like to encourage the text to exhibit as we decode. E.g. a sentiment classifier that encourages positive-sounding comments, or perplexity.

We update the intermediate activations at each layer of the model. This allows real-time distribution updating based on an outside discriminator.

These are both fairly expensive, and neither stops us from decoding bad sequences.

In practice, we often decode multiple sequences using sampling or a wider greedy search (e.g. 10 candidates). We define a score to approximate the quality of sequences and re-rank by this score rather than having to backpropagate gradients to the main model. The simplest is to use perplexity, but keep in mind that repetitive methods can generally get low perplexity.

Re-rankers can score a variety of properties such as style, discourse, entailment/factuality, logical consistence, and many more. Beware poorly-calibrated re-rankers. We can use multiple re-rankers in parallel.

Decoding is still a challenging problem in NLG. Human language distribution is noisy and doesn't reflect the simple properties that our decoder uses (i.e. probability maximization). Different decoding algorithms can allow us to inject biases that encourage different properties of coherent NLG. Some of the most impactful advances in NLG of the last few years (circa 2021) have come from simple but effective modifications to decoding algorithms.

Q: How do we evaluate if a rebalanced distribution is better? We can't just look at the probability:

A: If we don't trust that the ranker is giving us a good estimation of how good the test is, reconsider using that ranker. More on this later, although there is a lot of room for interpretation.

Q: We don't know how to make a model choose words like a human. How do we model humans from different backgrounds, etc.?

A: We could try to fine-tune the language distribution of a particular human from a pretrained LM, or we could try to do these gradient-based rebalancing methods with the corpus from a particular speaker.

Training

Recall that we are using teacher forcing to minimize the negative log likelihood of the next token given the preceding tokens. This works well for training autoregressive models of human language, but it discourages diverse text generation.

Unlikelihood training

One approach is unlikelihood training: given a set of undesired tokens C, lower their likelihood in context.

Keep the teacher forcing objective and combine them for the final loss function.

Exposure bias

Training with teacher forcing leads to exposure bias at generation time. During training, our model's inputs are gold context tokens from real, human-generated text. At generation time, our model's inputs are previously-decoded tokens, which are often not a very close approximation of human language patterns in the training set.

Some solutions:

[These first two are skipped over.]

Scheduled sampling: with some probability p, decode a token and feed that as the next input, rather than the gold token. Increase p over the course of training.

This leads to improvement in practice, but can lead to strange training objectives.

Dataset Aggregation (DAgger): At various intervals during training, generate sequences from our current model. Add those sentences to our training set as additional examples.

Sequence re-writing: The model learns to retrieve a sequence from an existing cached corpus of human-written prototypes (e.g., dialogue responses). It learns to to edit the retrieved sequence by adding, removing, and modifying tokens in the prototype to more accurately reflect the current context.

Reinforcement learning: cast our text generation model as a Markov decision process.

State s is the model's representation of the preceding context
Actions a are the words that can be generated
Policy π is the decoder
Rewards r are provided by an external score
We can learn behaviors by rewarding the model when it exhibits them.

We are rewarding each token that we generate. We scale the sample loss by this reward.

What can we use as a reward function? We can just use our final evaluation metric.

BLEU (machine translation)
ROUGE (summarization)
CIDEr, SPIDEr (image captioning)

This doesn't work as well as we might hope: evaluation metrics are merely proxies for generation quality. "even though RL refinement can achieve better BLEU scores, it barely improves the human impression of the translation quality" --Wu et al., 2016.

So what behaviors can we tie to rewards?

Cross-modality consistency in image captioning
Sentence simplicity
Temporal Consistency
Utterance Politeness
Paraphrasing
Sentiment
Formality

Unfortunately, we need to pretrain a model with teacher forcing before doing RL training. The reward function probably expects coherent language inputs, so we need to set an appropriate baseline.

Examples:

Use linear regression to predict the baseline from the state s
Decode a second sequence and use its reward from the baseline.

The model will learn the easiest way to exploit the reward function. We can try to mitigate these shortcuts, or hope it's aligned with the behavior we want.

Takeaways

Teacher forcing is still the premier algorithm for training text generation models.

Diversity is an issue with sequences generated from teacher forced models. New approaches focus on mitigating the effects of common words.

Exposure bias causes text generation to lose coherence easily. Models must learn to recover from their own bad samples (e.g., scheduled sampling, DAgger), or not be allowed to generate bad text to begin with (e.g., retrieval + generation).

Training with RL can allow models to learn behaviors that are challenging to formalize, but learning can be very unstable.

Q: Are there online language simulators that can be used to train with RL online, or is this only offline training? A: No, only offline training.

Evaluation

We should think about how to do this before we even start training a model.

Content overlap metrics

Compute a score that indicates the similarity between generated and gold-standard (human-written) text. This is fast, efficient, and widely used. There are two broad categories:

N-gram overlap metrics: BLEU, ROUGE, METEOR, CIDEr, etc.
Semantic overlap metrics: PYRAMID, SPICE, SPIDEr, etc.

These provide a good starting point for evaluating the quality of generated text, but are not good enough on their own.

N-gram overlap metrics

Most n-gram overlap metrics are not ideal for machine translation, and they get progressively much worse for tasks that are more open-ended than machine translation. They are worse for summarization, as longer output texts are harder to measure. They are much worse for dialogue, which is more open-ended than summarization, since it can have multiple responses that mean the same thing but don't share any words.

N-gram overlap metrics have no concept of semantic relatedness, and these metrics often don't correlate with human judgments at all.

Much, much worse is story generation, which is also open-ended, but whose sequence length can make it seem like we're getting decent scores.

Semantic overlap metrics

Model-based metrics

We use learned representations of words and sentences to compute semantic similarity between generated and reference texts. There is no more n-gram bottleneck because text units are represented as embeddings. Even though the embeddings are pretrained, the distance metrics used to measure the similarity can be fixed.

Word Mover's Distance measures measures the distance between two sequences (sentences, paragraphs, etc.) using word embedding similarity matching, where each word vector is paired with some other word vector in the opposite sequence, allowing the evaluation metric to calculating the matching. BERTSCORE is similar but uses pretrained BERT contextual embeddings. Two more are Sentence Movers Similarity and BLEURT.

Model-based metrics are more correlated with human judgment, but their behavior is not interpretable.

Human evaluations

Automatic metrics fall short of matching human opinions of the quality of generated text. >75% of generation papers at ACL 2019 include human evaluations. It is the gold standard it developing new automatic metrics: these metrics must correlate well with human evaluations.

TL;DR Ask humans to evaluate the quality of generated text, overall or along some specific dimension:

fluency
coherence/consistency
factuality and correctness
common sense
style/formality
grammaticality
typicality
redundancy

N.B. Don't compare human evaluation scores across differently-conducted studies, even if they claim to evaluate the same dimensions. They are unstandardized for many reasons, e.g., the task was explained differently to the evaluators.

We know that human evaluations are slow, expensive, and unstandardized. But there more issues to conducting human evaluation effectively. Humans are inconsistent, can be illogical, lose concentration, misinterpret the question, can't always explain why they feel the way they do, etc. Human evaluators are also incentivized to complete the task as quickly as they can.

Human judgments are critical: they are the only ones that can directly evaluate factuality, i.e. whether the model is saying correct things. In many cases, the best judge of output quality is you. You should look at your model generations rather than relying solely on any numerical score.

Ethical considerations

Tay was a chatbot released by Microsoft in 2016. Within 24 hours, it started making toxic racist, sexist, white supremacist, etc. comments. What went wrong?

Text generation models are often constructed from pretrained LMs, which learn harmful patterns of bias from large language corpora. When prompted for this information, they repeat negative stereotypes.

The learned behaviors of text generation models are opaque. Adversarial prompts can trigger very toxic content. These models can be exploited in open-world contexts by ill-intentioned users. Pre-trained LMs can degenerate into toxic text even from seemingly innocuous prompts. Models should not be deployed without proper safeguards to control for toxic content. They should not be deployed without careful consideration of how users will interact with it.

Large-scale pretrained LMs allow us to build NLG systems for many new applications. But does the content we're building the system to automatically generate really need to be generated automatically? Any tool you create could be used in a negative way, e.g. "fake news" generation.

Final thoughts

Interacting with NLG systems quickly shows their limitations (N.B. in 2021). Even in tasks with more progress, there are still many improvements ahead. Evaluation remains a huge challenge, and we rely on human evaluation. Finding better automatic evals would bootstrap improvements in other areas of NLG. With the advent of LLMs, deep NLG research has been reset, and it's never been easier to jump in the space.