Lecture 1 - bancron/stanford-cs224n GitHub Wiki

Lecture video: link

This lecture is about the surprising result that word meaning can be represented really well by a large vector of real numbers.

Human language and word meaning

Language is a social system, constructed and interpreted by people. Human languages are great as an adaptive system for human beings, but difficult for computers to understand.

In the last decade, machine translation has gotten to a place where it works pretty well. Previously this was a difficult human task.

The single biggest development in NLP recently was GPT-3 - a large language model (LLM). It’s the first step on the path to “universal models” - if you train up one very large model on all the data in the world, it will have knowledge of the world, human languages, how to do tasks, etc., and you can apply it to many different things rather than having a separate classifier and model for each task.

It basically works by taking text and predicting the next word, one word at a time. This can do many surprising things, e.g. translate human language sentences into SQL.

What is meaning? Denotational semantics is a term for the idea that meaning is a pairing between a signifier (symbol) and the signified (idea or thing). NLP has often used dictionaries and thesauruses. A common one is Wordnet (contained in NLTK), which uses synonym sets and hypernyms (e.g. bird and animal are hypernyms of pigeon). This misses some nuance, is impossible to keep up to date, and can’t compute accurate word similarity.

In traditional NLP, we regard words as discrete symbols - “hotel”, “conference”, “motel” - a localist representation, represented by one-hot vectors. e.g.

motel = [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
hotel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

These two vectors are orthogonal. There is no natural notion of similarity for one-hot vectors.

People have tried to calculate similarity based on synonyms, but they tended to fail badly from incompleteness. Instead we’re going to learn to encode similarity in the vectors themselves.

Distributional semantics is the idea that a word’s meaning is given by the words that frequently appear close to it. “You shall know a word by the company it keeps” – J.R. Firth 1975. This is a very computational sense of semantics which is used successfully in many deep learning systems.

When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-sized window).

word-context

A particular instance of a word is a token of the word e.g. “banking” above. The word “banking” across instances is a type.

We want to build up a dense real-valued vector that represents the meaning of the word. It will be useful for predicting other words that appear in similar contexts.

banking = [0.286, 0.792, -0.177, ...] (common size is 300-dimensional vectors).

These are also known as word embeddings because when we have a whole bunch of words, these representations place them all into a high-dimensional space - they are “embedded” in that space. They are distributed representations (rather than localized).

It is difficult to visualize a 300-dimensional space, but you can look at a 2-dimensional projection of that space, although you're losing almost all the information in that space and crushing things together that are actually far apart.

Here's a visualization of part of the nationality word space.

image

Word2vec introduction

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors. It is used

Idea:

  • We have a large corpus ("body") of text
  • Every word in a fixed vocabulary is represented by a vector - say, the 400k most common words
  • We go through each position t in the text, which has a center word c and context ("outside") words o
  • We will use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
  • Keep adjusting the word vectors to maximize this probability

How do we calculate P(w[t-2]|w[t])? Our objective function:

The likelihood is calculated as the product using each word as the center word, the product of each word in the window around it, the probability of predicting that context word given the center word.

Our objective function (cost or loss function). We take log likelihood to make it easier (faster to calculate sums), average it (1/T where T is number of words in the corpus), and negate it (so we can minimize it).

We will use one word vector when the word is a center word, and one when used as a context word, to simplify building the model.

The dot product is a simple way to represent "similarity" of two vectors. We'll turn this into a probability distribution: exponentiate them (to get a positive number) and then normalize them (so they are less than 1). This is an example of the softmax function. It returns a distribution amplifying the probability of the largest x[i], rather than a single "max" as the name might imply.

We want to train a model - change the parameters to minimize the loss. The only parameters are our word vectors. We have d-dimensonal vectors and V-many words. (Remember that each word has two vectors.) Our job is to compute all vector gradients.

Multivariate Calculus

We have J(θ) and P(o|c) from the slides above. We want to take the partial derivative of P(o|c) w.r.t. each word vector - center and outside. Watching the lecture for a bit at this timestamp is a good way to learn if you don't already know how to do the calculus.

We end up getting the loss = the observed - the expected.

Interesting Word2vec analogy task: If you take the vector [king] and subtract [man] and add [woman], you will get [queen].

Student Q&A

The most common way to collapse the center vector representation of a word and the context vector is just to average them.

Word2vec is bad at sentiment analysis and antonyms, since such words tend to appear in very similar contexts (e.g. "the movie was fantastic" as well as "the movie was terrible").

Word2vec ignores the position of words - the word just before the center word is treated identically to the word just after. We need to use a language model to start to work on problems that depend on the specific ordering of words.