GPT - rFronteddu/general_wiki GitHub Wiki

Generative Pre-Trained Transformer

Generative: Models that generate new text.
Pre-Trained: Models trained on a massive amount of data. The prefix insinuates there is more room for additional training.
Transformer:: A transformer is a specific kind of architectural model.

The original transformer that was introduced by the paper Attention is All You Need was intended to be used just for text translation. Current models like GPT are transformed-based model trained to take text as an input and produce a prediction for what comes next in a passage. That prediction takes the form of a probability distribution over many chunks of text that may follow.

Predicting the next word is not that different in this context from generating sentences if you think that a sentence is a sequence of words and we have a lot of examples of text that make sense. So we can keep predicting the next word and adding it to the input to generate the next and so on.

You may still wonder how do we get to word prediction to reasonable logic and more. It is my opinion that this capability comes from the fact that examples of reasoning can be learnt from the way sentences are built. Iterative prediction can then enable emergent behaviors that look like reasoning.

High Level Preview on How data flows through a transformer

In broad strokes, when a transformer model generates the next words this is what happens:

First, the input is broken up in tokens (single or pieces of words, patches of images, etc)
Each token is associated with a vector encoding its meaning.
These vectors are passed through an attention block that modifies each vector based on the information of the other vectors in the input.
The attention block is responsible for finding which token in the context are relevant to other tokens in the same context and how to update them to reflect that.
After that, the vectors passes through a Multilayer Perceptron/Feed Forward Network. While this section is not easily interpretable, it is hypothesized that this block contains the general knowledge of the LLM. The input vector is then modified to be affected by this information too.
In "Deep Learning Style" this process is repeated many times hoping that what makes the sentence the sentence has been baked into a vector that can be used to predict the next token (or better to produce a probability distribution from which we pick the next chunk of text that comes next, this is where temperature comes into play, a temperature of 0 always pick the most probable).
To make a tool like this into a chatbot, you can add some text that establishes the setting such as "What follows is a conversation between a user and a helpful, knowledgeable AI assistant." (often called system prompt). You can then add to it the user input and let the model predict a response.
Note that there are several other steps and block necessary to make this work well (normalization, fine-tuning, ...)

Embedding

The model has a predefined vocabulary, some list of all possible tokens. The embedding matrix has a column for each of these that determine what each token turns into (generally labeled $W_E$).
$W_E$ values begin random but are then changed based on training data. The model learns relationship between tokens during training that can then be used to encode/embed inputs.
It is important to understand that embeddings do not represent singular tokens, they try to embed the general context of a token observed in the training data.
The training also "geometrically balance" the space of the embeddings to improve prediction.
That is why dimensions shouldn't be taken singularly to figure out what they express. The global relationship is the important thing.
The meaning of a token is often determined by other tokens that may be far away (think about describing something and then giving it a name). The model needs then a component to figure out what are the parts of the input that are relevant to obtain a better prediction.
Initially $W_E$ only encodes the tokens. Though subsequent layers the model will attempt change this so that tokens influence each other.
An embedding initially encodes all possible meaning of a word that the model as learnt.
The aim of a transformer is to progressively adjust these embeddings so they don't merely encode an individual word but instead they bake in some much richer contextual meaning.

Attention Block

The model can only process a subset of the tokens at a time, known as context size (GP3 for example had ~2K tokens). The context size limits the size of the input the model can use to make a prediction.
The attention block enable to move information encoded in one embedding to that of another (potentially far away, and potentially with information that is much richer than a single word).

Single-headed attention

Initially, each embedding only encodes the meaning of each specific token regardless of context and position of each word.
Imagine wanting to create another embedding that bakes nouns and adjectives of the input together.
Imagine each noun asking whether there is any adjective in front of it. Two other tokes may be able to say yes to this.
This question is encoded in a weight vector that the model learnt during training. By multiplying the embeddings and these weights we obtain a representation of the question called Query Vector.
In reality, in deep-learning style, the question doesn't have a single objectives but it is trying to optimize a number of aspects/question of the problem that will help the model make better predictions (this phenomenon characteristic of deep learning is often called superimposition).
There is a second matrix called Key Matrix that the model multiplies with the embeddings to obtain the Key Vector. This Vector tries to potentially answer the previous question.
The more the Key of an embedding matches the Question of an embedding the more the first embedding answer to the question of the second.
If we do the dot product between the two, the higher the product the more they match, or in lingo the first embedding attends (if they are related) to the second.

We get a vector that tells us how relevant is each token in updating the meaning of another token.

Finally we have a third matrix called the Value Matrix that is used to modify an embedding that attends to a question.

Multi-headed attention

The single-head attention process is done in parallel multiple times. By running many attention blocks in parallel, the model can learn many ways in which the context changes the meaning of tokens.

Multi-Layer Perception

Although a full mechanism on how facts are stored remains unsolved, there are some interesting partial results. Including the high level conclusion that the facts seem to live inside a very specific part of these networks known fancifully as the Multi-Layer Perceptron (MLP).

The Multi-headed attention block allowed input tokens to pass information between one-another.

By passing through several of Attention and MLP blocks the hope is that the input will have soaked enough context from other tokens in the context and from the "general knowledge" baked into the model through training that it can be used to make a prediction of on what token comes next.

A majority of the model parameters live inside the MLP blocks.

To get an intuitive understanding, consider the embedding space where

There is a vector representing the idea of the first name Michael.
There is an idea on a perpendicular direction of the last name Jordan.
There is another direction that represents the idea of Basketball.

If we have an input vector with full name Michael Jordan, the dot product of the idea of Michael and Jordan should be close to 1 after the attention block.

By going through the MLP block, we can add to that resulting vector knowledge that the model was trained on, such as that Michael Jordan is associated with the idea of Basketball.

This is done in parallel for all the outputs of the Multi-attention block adding a variety of information.

Unembedding

The unembedding matrix $W_U$ has a row for each token in the vocabulary and each row has the same number of elements as the embedding dimension.

SoftMax

SoftMax is a standard way to take an arbitrary list of numbers in a probability distribution in such a way that the largest values end up closer to 1.`
Takes the unembedding map and creates the probability distribution.