Lecture 5 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture covers language models and recurrent neural networks (as well as the remainder of neural dependency parsing).

Neural dependency parsing (continued)

Reminder: transition-based dependency parsers are an efficient linear-time method for giving the syntactic structure of natural language text. Their biggest disadvantage is its use of indicative features - checking some condition, e.g. "the word on top of the stack is 'good' and it's an adjective".

The features are sparse.
The features are incomplete. There are certain features that will exist, and certain features that won't exist because it wasn't in the training data.
The computation is expensive. More than 95% of parsing time is consumed by computing the values of these features.

With a neural approach, we can learn a dense and compact feature representation. We will still have a stack and a buffer, and run the same transition sequence, but rather than representing the configuration of the stack and buffer by having several million symbolic features, we will summary the configuration as a dense vector (with dimensionality of maybe one thousand). This has unlabeled and attachment scores on par with graph-based parsers but is two orders of magnitude faster.

First win: distributed representations

We represent each word as a d-dimensional vector (word embedding). Similar words are expected to have close vectors. The part-of-speech tags (POS) and dependency labels are also represented as d-dimensional vectors. NNS (plural noun) should be close to NN (singular noun).

The classification decisions for a transition are made of a few elements of our configuration: the top of the stack, the second word in the stack, the first word of the buffer, and the dependents of the left and right of the elements on the stack that are already in the sets of arcs. For each of those elements, we have the word, the POS, and possibly a dependency arc with a label. We concatenate these all together to get a neural representation of this configuration.

Second win: Deep learning classifiers are non-linear classifiers

A very simple classifier is a softmax classifier. Traditional ML classifiers such as Naive Bayes, Support Vector Machines, and logistic regression can only give linear decision boundaries.

Neural networks can provide nonlinear decision boundaries, which is much more powerful.

Neural networks have a softmax layer at the top, but below that they have other layers of neural net. The classification decisions are linear w.r.t. the softmax at the top, but nonlinear w.r.t. the original space.

With a simple feed-forward neural net, we start with a dense representation of the input, put it through a hidden layer h = ReLU(Wx + b1), then an output layer y = softmax(Uh + b2). We calculate the log loss and back-propagate that to the embeddings. The inner layer moves the inputs around in an intermediate layer vector space so it can be easily classified with a (linear) softmax.

For NLP tasks we usually convert the 1-hot features (words) into our embeddings (dense input layer).

Our neural dependency parser is very similar. The lecturer (Manning) found in 2014 that this worked very well despite being very simple. The dense representation allowed it to outperform other greedy parsers in both accuracy and speed.

Since then, others, especially Google, built bigger, deeper networks with better hyperparameters, beam search, and conditional random field (CRF)-style inference. SyntaxNet and Parsey McParseFace (2016) had even better accuracy.

Graph-based dependency parsers

The alternative to a transition-based dependency parser is a graph-based dependency parser. This works by considering every pair of words (including ROOT) and calculating a score for the likelihood that one is a dependent of the other using a minimal spanning tree (MST). We need to know more than just what the pair of words are - we want the "context" as well.

In 2017 Manning worked on a neural graph-based dependency parser. It worked 1% better than the best Google neural transition-based dependency parser, but they are much slower (O(n^2) rather than linear).

More about neural networks

Regularization

We are building model with a huge number of parameters. Almost all neural models have a regularized loss function over all parameters θ, e.g. L2 regularization. The regularization term sums the square of every parameter in the model. This means that parameters will only be nonzero if they're useful. (The penalty is calculated once per parameter rather than once for each example.) We do this to prevent overfitting when we have a lot of features. Overfitting is when the training error continues to decrease, but on new data (test data) the error eventually starts getting worse - the parameters of the model are very good at predicting the training examples, but don't generalize to new examples.

The classic view of regularization is slightly outmoded and wrong for neural networks. For modern large neural networks, hugely overfitting on the training data isn't a problem, but we nonetheless need regularization to make sure that the models generalize well to independent test data.

We need to work out how much to regularize here with the λ parameter. L2 regularization is not powerful enough, so we will use another technique called dropout to avoid feature co-adaptation. When training a model, for each batch, for each neuron in the model, you drop 50% of its inputs (zero out elements of the layers). At test time you keep all the model weights but halve them. We avoid features that are only useful in the presence of other features because features are being randomly dropped. This makes each layer a kind of middle ground between Naive Bayes (where all the weights are set independently) and logistic regression models (where weights are set in the context of all the others).

This can be thought of as a form of model bagging (like an ensemble model). Another way to look at it is that dropout is a strong, feature-dependent regularizer.

For the backward pass, we preserve the dropout (no gradient is going through the values that were dropped out). In a particular batch, we are only training weights that were not dropped out. We can regularize different layers different amounts, e.g. only drop the first layer inputs a bit (15% or not at all) while other layers are regularized more (50%), or even more complicated approaches.

Vectorization

Use vectors and matrices, not for loops! You get a speed gain of an order of magnitude on a CPU (and two orders of magnitude on a GPU).

Non-linearities

We have to have non-linearities in neural nets because multiple linear transformations, composed, can be collapsed into a single linear transformation. We get no additional power by having more linear layers.

The classic non-linearity is the logistic/sigmoid function, which maps any real number onto the range (0, 1). One disadvantage is that it moves everything into the positive space, so a varient of the sigmoid is tanh (hyperbolic tan). Tahn is just a rescaled and shifted logistic curve which maps symmetrically into (-1, 1).

These have exponentials and are slow and expensive to compute. The hard tanh flatlines at y = -1 for x < -1, then is y = x until 1, and y = 1 if x > 1. We're not getting any discrimination above a certain value. This led into the most widely used nonlinearity, ReLU (Rectified Linear Unit). This is y = max(x, 0).

Logistic and tahn are used in various places (e.g. to get a probability) but are no longer the defaults for making deep networks. ReLU networks train quickly and perform well due to a good gradient backflow. People have also tried "leaky ReLU" where x < 0 is slightly negatively sloped, "parametric ReLU" where the negative part is learned, and "Swish" which dips a bit before becoming linear, but they don't help all that much, so people usually use ReLU.

Parameter initialization

In almost all cases, we must initialize the weights to small random values rather than zero. This avoids symmetries that prevent learning and specialization. It's fine to set bias weights to zero. We want to choose all other weights in the Uniform(-r, r) range, with r chosen so that numbers get neither too big nor too small. Xavier initialization chooses r based in on the fan-in (n[in] is the previous layer size) and fan-out (n[out] is the next layer size). The need for this is removed with the use of layer normalization.

Optimizers

Usually stochastic gradient descent will work just fine. However, this often requires hand-tuning the learning rate. There are a series of more sophisticated "adaptive" optimizers that can keep track of how much gradient there was for different parameters, and make decisions about how much to adjust the weights when performing a gradient update. Some of these are Adagrad, RMSprop, Adam, SparseAdam, etc. (No details given on these.) Adam is a fairly good, safe place to begin in many cases.

On the other hand, we can use a constant learning rate. Try around 0.001 to start - we want to get the right order of magnitude. Too big and your model may diverge or not converge; too small and your model may train very, very slowly. Better results can generally be obtained by allowing learning rates to decrease as you train. We can do this by hand by e.g. halving the learning rate each k epochs. (An epoch is a pass through the entire shuffled training data.) There are also fancier methods like cyclic learning rates. Adam also needs an initial learning rate.

Language models (LMs)

Language modeling refers to the task of predicting what word comes next. E.g. "the students open their ___". Books? Laptops? Exams?

More formally, given a sequence of words x(1), x(2), ..., x(t), compute the probability distribution of the next word x(t+1): P(x(t+1) | x(t), ..., x(1)). A system that does this is a language model.

We can also think of this as assigning probability to a piece of text.

Language models are the cornerstone of language technology. For example, suggesting the next word when typing on your phone or in a Google doc. The one in Google docs works much better than the one on your phone so that the phone version can run quickly and with low memory usage.

N-grams

How can we learn a language model (traditionally, without deep learning)? For several decades we used n-gram language models. An n-gram is a chunk of n consecutive words. They are called unigrams (n=1), bigrams (n=2), trigrams (n=3), and then the number e.g. 4-grams (n=4) for n=4 and above.

First we make the Markov assumption that the word x(t+1) depends only on the preceding n-1 words rather than all previous words.

To get these probabilities, we count words in a large corpus and calculate count(x(t+1), x(t), ... x(t-n+2)) / count(x(t), ..., x(t-n+2)), i.e. the count of the n-gram divided by the count of the (n-1)-gram.

Suppose we are learning a 4-gram language model. Our input is "as the proctor started the clock, the students open their ____". First we discard everything before "students". P(w|students opened their) = count(students opened their w) / count(students opened their).

Note that for a 4-gram language model, the numerator uses 4-grams and the denominator uses trigrams. (This differs from the terminology for Markov models, so this would correspond to a third order Markov model.) This is related to Naive Bayes, but differs in a couple of ways. First, Naive Bayes works out the probability of a word independent of its neighbors - it is a unigram model. Second, the classifier learns a different set of unigram counts for every class in the classifier, so effectively it has class-specific unigram language models.

Realistically we will have sparsity problems - for many word sequences, that sequence never occurred in the training data, so the numerator would be zero. A partial solution is to add a small delta to every count (smoothing). In addition, it might be that the denominator also never occurs and is zero. We can shorten the context (backoff), backing off to a trigram and then bigram and then unigram model. Note that increasing n makes the sparsity worse. Typically we can't have n greater than 5.

There is also a problem with storage space. We need counts of every single n-gram we saw in the corpus. We needed to run in the cloud to hold these huge tables. Neural net models can be massively more compact. But the n-gram models are very simple and fast - we can build a simple trigram language model over a 1.7 million word corpus (Reuters) in a few seconds on a laptop.

Once we've done that, we can start generating text. "today the ___" -> company: 0.153, bank: 0.153, price: 0.077, italian: 0.039, emirate: 0.039, .... (We can see that these are pretty coarse estimates of something that occurred 3 times, something that occurred twice, and something that occurred once in the corpus.) Based on this, we sample a word from that probability distribution - let's say we sample "price".

"the price ___" -> of: 0.308, for: 0.050, it: 0.046, .... Let's say we sample "of". Now condition on "price of ___". Eventually we get a sentence like "today the price of gold per ton , while production of shoe lasts and shoe industry , the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks , sept 30 end primary 76 cts a share ."

This is surprisingly grammatical, but incoherent and nonsensical. We will need more than three words at a time if we want to model language well.

Neural language models

So far all we know about is a fixed-window-based neural classifier, as we saw in NER previously. We will discard the further away words, then convert the words to embeddings, concatenate the word embeddings, put them through a hidden layer, and put a softmax classifier over our vocabulary at the end.

This precise model was introduced in 2000. It didn't allow for a bigger context, but did have the advantages of distributed representations. Rather than having counts for word sequences which are very sparse, we can use distributed representations of words that predict that semantically similar words should give similar probability distributions.

Improvements over n-gram LM:

no sparsity problem
less storage space

Remaining problems:

fixed window is too small
enlarging this window enlarges W
the window can never be large enough
words in different positions, x(1) and x(2), are multiplied by completely different weights in W - there is no symmetry in how the words are treated across different positions

We need a neural architecture that can process any length input, and have more sharing of parameters, while still being sensitive to proximity.

Recurrent neural networks (RNNs)

We have the hidden layer ("hidden state"), but we maintain it over time and feed it back into itself, hence "recurrent". Based on the first word, we compute the hidden representation and predict the next word. Then we feed in the second word and the hidden layer from the previous word to predict the hidden layer above the second word. We repeat this pattern at every time step.

Again, first we convert our 1-hot word vectors into word embeddings. We start with h(0) the initial hidden state, often taken as a vector of zeros. The formula for calculating the next hidden state and prediction is as follows:

RNN advantages:

can process any length input
computation for step t can (in theory) use information from many steps back
the model size doesn't increase for longer input context
the same weights are applied at every time step, so there is symmetry in how inputs are proccessed

RNN disadvantages:

recurrent computation is slow - we can do matrix-matrix multiplies at each time step, but calculating every time step based on the previous is basically a for loop
in practice, it's difficult to access information from many steps back

Next time we will talk about more advanced neural networks that are able to more effectively access past context.