Lecture 3 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture covers the math details of doing neural net learning, starting with doing gradients by hand using matrix calculus, then algorithmically using the backpropagation algorithm.

Named Entity Recognition

Named Entity Recognition (NER) is the task of finding and classifying names in text. For example,

Paris Hilton was arrested in the Hilton Hotel in Paris in April 1989.
PER   PER                        LOC    LOC      LOC      DATE  DATE

There are advanced ways of doing this, but one simple way of doing this with a simple neural network is to take word vectors, classify each word within its context window of neighboring words, and put those through a neural network layer and feed it through a logistic classifier.

For example, if Paris is the center word and we have a window of size 2:

the museums in [Paris] are amazing to see
    1       2  3       4   5

We concatenate the vectors for these 5 words into a column vector x ∈ ℝ^5d. We then feed this vector to a classifier which will output the probability of the word being a location; another one for the probability of the word being a person's name, etc.

We put x through the layer of a neural network to get h (multiply by a matrix and add a bias vector, then put that through a nonlinearity such as softmax). H is the hidden vector which may be of a smaller dimensionality. We will take the dot product of h and an extra vector u to get a single number. Then we will put that number through a logistic transformation (to transform it into a probability between 0 and 1).

We update a vector θ with the existing parameter θ based on the current loss J(θ). We move a little distance in the direction of the gradient. You can think of this as working out the partial derivative of the loss with respect to some parameter and moving a little bit in the negative direction of that.

We will update all the parameters of the model as we learn. Particularly, in contrast to what commonly happens in statistics, in addition to updating weights in classifier, we will also be changing the data (our word vectors) as we learn.

We need to be able to compute the gradient of these parameters to update the weights and efficiently train the model. To start we will learn to do that by hand, which is a review of matrix calculus.

Matrix calculus

Matrix calculus uses fully vectorized gradients. We can start by working out a non-vectorized version to build intuition. "Multi-variable calculus is just like single-variable calculus if you use matrices."

Given a function with 1 output and 1 input, e.g. f(x) = x^3, its gradient (slope) is its derivative: df/dx = 3x^2.

"How much will the output change if we change the input a bit?" At x=1, 0.01 change in input results in 1.01^3 = 1.03, or 3 times as much. At x=4 it changes 4.01^3 = 64.48, or about 48 times as much.

We have a function with n inputs and m outputs. Its Jacobian matrix is an m x n matrix of all the possible partial derivatives of an output variable w.r.t. one of the input variables.

The chain rule for the composition of one-variable functions is to multiply the derivatives. For the composition of multi-variable functions, we multiply their Jacobians. Let z = Wx + b (very simple neural network), then put through a nonlinearity h = f(z). F could be e.g. the sigmoid function.

h has n inputs and n outputs, so we are calculating an n x n Jacobian. We will take the partial derivative of each output w.r.t. each input. We are computing elementwise transformations, so if i = j we have something to compute, but if i != j the input has no influence on the output, so the derivative is 0. We get a diagonal matrix with the derivatives of each element along the diagonal.

The Jacobian for Wx + b with respect to x is W. This looks just like regular calculus! The partial derivative with respect to b is I (the Identity matrix). The partial derivative of the dot product, u^T * h with respect to u is h^T.

Going back to our neural net, let's calculate some partial derivatives. S is the score, and the parameters of the model are W, b, u, and the input x (updating the word vectors for the different words). Let's start with ∂s/∂b.

We can decompose h to make it a bit easier. It's worth writing out the dimensionality of every variable and making sure the answers you calculate have the correct dimensionality. S is the composition of three functions, so the gradient is the product of three partial derivatives.

The identity matrix disappears, so this decomposes down to u^T ⊙ f'(z), where ⊙ is the Hadamard product, also known as an elementwise product.

Let's now calculate ∂s/∂W. Everything is the same but now we're taking the gradient w.r.t. W. We should re-use some of the duplicated computation from the previous calculation to save time during training.

What do we want for the Jacobian of ∂s/∂W? It has 1 output and m x n inputs, so the Jacobian is a 1 x n*m matrix. But this is inconvenient for updating θ - we want the gradient to be the same shape as the W matrix. So we will depart from pure math and use the shape convention: the shape of the gradient is the same as the shape of the parameters.

So what is ∂s/∂W? Math hacks:

Why is this true? Consider the derivative w.r.t. a single weight W[ij]. W[ij] only contributes to z[i].

We get columns where x[1] is all that's left ... x[m] is all that's left (by column). It's a slightly hacky argument that makes the dimensions work out.

The Jacobian form is useful for doing calculus, but for stochastic gradient descent, that only works if you have the same shape matrix for the gradient as for the original matrix.

There are two options for how to do this: (1) work through all the math using Jacobians and then reshape right at the end to get the answer, or (2) always follow the shape convention by working out what shape we want it to be at every step. 2 is a bit hackier, and we always have to figure out when to transpose.

Backpropagation

The backpropagation algorithm is judiciously taking and propagating derivatives using the matrix chain rule. When we have these neural networks, we have a lot of shared structure and shared derivatives, so we want to maximally efficiently re-use derivatives of higher layers when computing derivatives for lower layers to minimize computation.

We construct computation graphs - a tree or directed graph of computations. It takes input parameters like W and b, and the interior nodes are operations. We pass the results of the operations along the edges.

Forward propagation is computing, for different inputs, what the output is. The essential additional element of neural networks is also sending back gradients which will tell us how to update the parameters of the model. This gives the models the ability to learn given a loss function, so that they minimize the loss. That is called backpropagation.

We progressively pass the gradients back along the graph. The node receives an upstream gradient, and the goal is to pass on the correct downstream gradient. For our function f, we work out the local gradient ∂h/∂z, then use the chain rule ∂s/∂z = ∂s/∂h * ∂h/∂z.

[downstream gradient] = [upstream gradient] * [local gradient]

If we have multiple inputs to a function e.g. matrix multiply z = Wx. We still have an upstream gradient, and we need to work out a local gradient w.r.t. each input - ∂z/∂W and ∂z/∂x. Then we do the same thing as before to work out the downstream gradients using the chain rule.

Here's a simple example (unrelated to neural nets). f(x, y, z) = (x + y)max(y, z). x = 1, y = 2, z = 0. First we build an expression tree for the forward propagation phase.

Next is backpropagation. First we work out our local gradients.

∂a/∂z = 1
∂a/∂y = 1
∂b/∂y = 1 iff y>z = 1
∂b/∂z = 1 iff x>y = 0
∂f/∂a = b = 2
∂f/∂b = a = 3

Downstream = upstream * local as we propagate it backwards. This is what we saw at the beginning of this lecture - if you wiggle the input by a bit, how much does the output change?

If you have multiple output branches, you sum the upstream gradients.

+ "distributes" the upstream gradient to each thing summed
max "routes" the upstream gradient - one branch gets all the gradient, others get 0
* "switches" the upstream gradient - inputs (3, 2) -> outputs (2, 3)

Efficiency: compute all gradients at once

The incorrect way to do backprop is to independently compute each gradient, which has a bunch of duplicate computation.

The correct way is to compute all the gradients at once, computing each step in the graph once - forward propagation and then backpropagation.

We have a single scalar output z, and inputs and parameters which compute z. We create a computation graph. First a starting point which does not depend on anything (bottom two nodes above). Then we compute forward along the toposorted graph. Next we run backprop by initializing the output gradient as 1, and in reverse topological order, compute gradients down. Compute the gradient of the note w.r.t. its successors (things it feeds into), using the generalized chain rule. For each output we calculate the product of the upstream gradient and the local gradient w.r.t. that node.

Done correctly, the big-O complexity of forward and backpropagation is the same. This algorithm works for arbitrary DAGs. Generally neural networks have a regular layer-structure so we can use matrices and Jacobians to parallelize the computation.

Automatic differentiation - if you know the computation graph, you can automatically calculate the derivatives and apply backprop to update the parameters and learn. Symbolic computation packages like Mathematica can take symbolic forms of computation graphs and calculate all the derivatives for you. For modern deep learning frameworks like Tensorflow and Pytorch, they do 90% of the computation for you, but they don't symbolically compute derivatives, so for each node and layer of a deep learning system, someone has hand-written the local derivatives. Then the chain rule combinations are automated.

This is what the software does for you:

Then you need to implement forward(x, y) -> ... and backward(dz) -> ....

You can manually work out the gradient to spot-check your implementation by calculating f'(x) = (f(x+h) - f(x-h)) / 2h - very similar to good old calculus, but with - f(x-h) instead of - f(x) to make it more accurate. This is approximate and very slow - you have to recompute f for each parameter of the model (could be millions).

Summary

Backpropagation is recursively (and hence efficiently) applying the chain rule along a computation graph.
- [downstream gradient] = [upstream gradient] * [local gradient]
The forward pass computes the results of operation, and saves intermediate values.
The backward applies the chain rule to compute gradients.

Why did we learn all this, when modern libraries will do basically all of this for you (e.g. PyTorch)? It's useful to have a deep understanding of things. It doesn't always work perfectly, and when it doesn't, the deeper understanding is crucial for debugging as well as improving models.