AI‐24sp‐2024‐05‐22‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting

Week 08

Ways to Think About Progress in AI

AI / ML "Tasks"

computer vision (image -> text)
speech synthesis (text -> sound)
chat (text -> text)
image generation (text -> image)

Typically, single modality input, single modality output.

GPT-4o ("omni", or multi-modal) released in past week

New modalities for input (encoding video, audio, etc. as (tensors of) numerical data
New modalities for output (decoding (tensors of) numerical data to video, audio, text)
Inference time (delay from when you submit your query to getting a response)
Fewer shots / examples needed
Larger models, more parameters (including weights and biases)
(image, text) -> text
- "This is a picture I took of a flower while hiking on Mt. Rainier, can you tell me what it is?"
(image, text) -> (image, text)
- "This is a photo of a house with a collapsed roof, can you generate an image of what the house will look like with a new roof?"
Stable diffusion images

In some tasks, we reduce one problem to another:

for speech synthesis, takes text to an image of sound frequencies, then transforms that into sound.

Mel Sonograms

Attention Mechanism

Connecting Different Modalities

$$ Modality = { Audio, Video, Text } $$

Something as simple as our MNIST examples already link together two modalities.

Images -> Text (single numeric digit)

What is one way of thinking about how this connection occurs?

From 3blue1brown's Chapter 5: What is a GPT?

Vectors, Vector Spaces, Word Embeddings, Similarity / Difference and Dimensions

Dot Products (Vector Overlap, or Similarity)

Overlap, or cosine similarity, is the amount the two vectors coincide when one is projected onto the other.

This is equivalent to drawing a line from the tip of one vector until it intersects the other at a 90 degree (perpendicular) angle.

For two normalized vectors (of length 1), this is the same as the cosine of the angle between them.

Dissimilar vectors have a negative dot product.

Orthogonal vectors have zero overlap.

What is a normalized vector's dot product with itself?

Attention Mechanism

From 3blue1brown's Chapter 6: Attention mechanism, visually explained

Query Matrix

While the embedding matrix "uplifts" words (token IDs) in a vocabulary to a (usually) larger-dimensional embedding space, where there are many different ways for words to relate to one another (similar / different), to capture expressiveness.

the query and key matrix "projects" down to a (usually) lower-dimensional embedding space, where it is more efficient to measure overlaps (dot products) between applying the query matrix and getting back resulting key matrix responses, per word.

Key Matrix

Questions:

How many query / key matrices are there? How do we choose the number, and how do we train them differently?
- Is the number of query / key matrices another dimension or hyperparameter or training?
- Is the number of query / key matrices one-to-one?
- How is this number related to single- vs. multi-headed attention mechanisms?

Masking Attention

We want to disallow later words to influence or "attend to" earlier words in a context length (chunk), so in our overlapped query/key matrix, we zero out the lower-triangle which represent these unwanted influences.

This is similar to if you are using an AI to predict the next value in time-series data, like housing prices, animal populations, etc. You only want to train on past data, because when you run inference, you don't have access to the future.

Value Matrix / Vector

The (masked) product of the query and key matrices in the previous step tell you how much influence earlier words in a context length (chunk) have on later words.

Multiply a word's embedding by a value matrix gives a value vector

Interpretation: if the (previous) word is relevant (attends to) a later word, the value vector gives the shift to add to the later word in order to better reflect the two words together (in order).

Questions: (similar to the ones for key-query matrices above)

How many value matrices are there?
- Are they one-to-one for each key-query matrix, or is there just one for all words? (seems unlikely)

Multiply single (masked) value matrix by block of column word vectors to get the updates $$\Delta \hat{E}$$ that we add to our embeddings $$\hat{E}$$.

One Head of Attention

Each head of attention is parameterized by these three matrices: key, query, value, which update all the word embeddings in a context length (chunk) at once.

Parameters / Matrix Shape Check

Our single head of attention so far was based on adjectives updating nouns