AI‐24sp‐2024‐05‐22‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
AI Self-Hosting
Week 08
Ways to Think About Progress in AI
AI / ML "Tasks"
- computer vision (image -> text)
- speech synthesis (text -> sound)
- chat (text -> text)
- image generation (text -> image)
Typically, single modality input, single modality output.
GPT-4o ("omni", or multi-modal) released in past week
-
New modalities for input (encoding video, audio, etc. as (tensors of) numerical data
-
New modalities for output (decoding (tensors of) numerical data to video, audio, text)
-
Inference time (delay from when you submit your query to getting a response)
-
Fewer shots / examples needed
-
Larger models, more parameters (including weights and biases)
-
(image, text) -> text
- "This is a picture I took of a flower while hiking on Mt. Rainier, can you tell me what it is?"
-
(image, text) -> (image, text)
- "This is a photo of a house with a collapsed roof, can you generate an image of what the house will look like with a new roof?"
-
Stable diffusion images
In some tasks, we reduce one problem to another:
for speech synthesis, takes text to an image of sound frequencies, then transforms that into sound.
Attention Mechanism
Connecting Different Modalities
$$ Modality = { Audio, Video, Text } $$
Something as simple as our MNIST examples already link together two modalities.
- Images -> Text (single numeric digit)
What is one way of thinking about how this connection occurs?
From 3blue1brown's Chapter 5: What is a GPT?
Vectors, Vector Spaces, Word Embeddings, Similarity / Difference and Dimensions
Dot Products (Vector Overlap, or Similarity)
Overlap, or cosine similarity, is the amount the two vectors coincide when one is projected onto the other.
This is equivalent to drawing a line from the tip of one vector until it intersects the other at a 90 degree (perpendicular) angle.
For two normalized vectors (of length 1), this is the same as the cosine of the angle between them.
Cosine similarity images from Towards Data Science
Dissimilar vectors have a negative dot product.
Orthogonal vectors have zero overlap.
- What is a normalized vector's dot product with itself?
Attention Mechanism
From 3blue1brown's Chapter 6: Attention mechanism, visually explained
Query Matrix
While the embedding matrix "uplifts" words (token IDs) in a vocabulary to a (usually) larger-dimensional embedding space, where there are many different ways for words to relate to one another (similar / different), to capture expressiveness.
the query and key matrix "projects" down to a (usually) lower-dimensional embedding space, where it is more efficient to measure overlaps (dot products) between applying the query matrix and getting back resulting key matrix responses, per word.
Key Matrix
Questions:
- How many query / key matrices are there? How do we choose the number, and how do we train them differently?
- Is the number of query / key matrices another dimension or hyperparameter or training?
- Is the number of query / key matrices one-to-one?
- How is this number related to single- vs. multi-headed attention mechanisms?
Masking Attention
We want to disallow later words to influence or "attend to" earlier words in a context length (chunk), so in our overlapped query/key matrix, we zero out the lower-triangle which represent these unwanted influences.
This is similar to if you are using an AI to predict the next value in time-series data, like housing prices, animal populations, etc. You only want to train on past data, because when you run inference, you don't have access to the future.
Value Matrix / Vector
The (masked) product of the query and key matrices in the previous step tell you how much influence earlier words in a context length (chunk) have on later words.
Multiply a word's embedding by a value matrix gives a value vector
Interpretation: if the (previous) word is relevant (attends to) a later word, the value vector gives the shift to add to the later word in order to better reflect the two words together (in order).
Questions: (similar to the ones for key-query matrices above)
- How many value matrices are there?
- Are they one-to-one for each key-query matrix, or is there just one for all words? (seems unlikely)
Multiply single (masked) value matrix by block of column word vectors to get the updates $$\Delta \hat{E}$$ that we add to our embeddings $$\hat{E}$$.
One Head of Attention
Each head of attention is parameterized by these three matrices: key, query, value, which update all the word embeddings in a context length (chunk) at once.
Parameters / Matrix Shape Check
Our single head of attention so far was based on adjectives updating nouns