10 Attention Mechanisms - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

1. Word Embeddings - Vector Representation of Meaning

1.1 The Concept: From One-Hot to Dense Vectors

Before embeddings, we used One-Hot Encoding. If our vocabulary size ($V$) was 50,000, the word "Apple" was a vector of 49,999 zeros and a single one.

Problems with One-Hot Encoding:

  1. Sparsity: High memory usage ($V \times V$ matrices).
  2. Lack of Semantics: The dot product of "Apple" and "Orange" is $0$. The model thinks they are as unrelated as "Apple" and "Physics."

Word Embeddings solve this by mapping words into a continuous, low-dimensional space $\mathbb{R}^d$ (where $d$ is typically 50 to 1024).

1.2 How Vectors are Generated: The Word2Vec Framework

We don't "assign" numbers to words; we train them. The core intuition is the Distributional Hypothesis: "A word is characterized by the company it keeps."

A. The Architecture (Skip-gram)

In the Skip-gram model, we use a center word ($w_t$) to predict its surrounding context words.

  1. Input: A one-hot encoded vector $x \in \mathbb{R}^V$.
  2. Hidden Layer (The Embedding): A weight matrix $W_{embed} \in \mathbb{R}^{V \times d}$.
    • When you multiply the one-hot vector by $W_{embed}$, you are effectively "selecting" one row of the matrix. This row is the Word Embedding.
  3. Output Layer: A second matrix $W_{output} \in \mathbb{R}^{d \times V}$ followed by a Softmax to predict context probabilities.

B. The Objective Function

We maximize the log-likelihood of observing the context words $c$ given the target word $w$:

$$J(\theta) = \frac{1}{T} \sum_{t=1}^{T} \sum_{-m \le j \le m, j \neq 0} \log P(w_{t+j} | w_t)$$

1.3 Mathematical Intuition: The "Lookup Table"

Technically, an Embedding layer is just a Linear Layer without a bias or activation function.

$$e_w = x_w \cdot W_{embed}$$

Where:

  • $x_w$ is the One-Hot vector $[0, 0, 1, ..., 0]$.
  • $W_{embed}$ contains the learnable weights.
  • $e_w$ is the resulting dense vector.

During Backpropagation: If the model sees "Coffee" and "Tea" in similar contexts (e.g., "...drank hot ___"), the gradients will push the vectors for "Coffee" and "Tea" closer together in the $d$-dimensional space to minimize the loss.

1.4 Geometric Properties

Once trained, these vectors exhibit spatial relationships often visualized through Cosine Similarity: $$\cos(\theta) = \frac{A \cdot B}{|A| |B|}$$

Note: This leads to the famous linguistic analogies: $$Vector(\text{"King"}) - Vector(\text{"Man"}) + Vector(\text{"Woman"}) \approx Vector(\text{"Queen"})$$


2. Positional Encoding (PE)

2.1 The Need for Position

As established in Section 1, embeddings provide semantic meaning but lack structural order. Because the Transformer's attention mechanism is permutation invariant (treating a sequence like a "bag of words"), we must inject positional information.

  • The Problem: Without PE, "Dog bites man" and "Man bites dog" result in identical hidden representations.
  • The Solution: We add a unique positional signal to each token embedding.

2.2 Sinusoidal Positional Encoding

The original Transformer uses fixed sinusoidal functions of varying frequencies to map each position $pos$ and dimension index $i$ to a unique value.

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

Integration Method:

Unlike some architectures that concatenate, Transformers add the encoding directly to the embedding.

$$\text{Input to Model} = \text{Token Embedding} + \text{Positional Encoding}$$

Note: $i$ refers to the dimension index within the embedding vector. This ensures that every "slot" in the $d_{model}$ vector is modified by a unique wave frequency.

2.3 Why Sinusoids Work

Sinusoidal encodings were chosen for several mathematical advantages:

  1. Uniqueness: Every position $(0, 1, 2, \dots)$ results in a unique vector signature.
  2. Smooth Variation: Nearby positions have similar encoding vectors, helping the model generalize distance.
  3. Relative Positions: Because of trigonometric identities, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. This allows the model to easily learn to attend by relative distances (See 2.4).
  4. Extrapolation: Theoretically, this method allows the model to handle sequence lengths longer than those seen during training (unlike learned embeddings).

2.4 Mathematical Proof: Relative Position Property

A key property of sinusoids is that $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$. This allows the model to easily learn and attend to relative distances.

The Rotation Matrix

For each pair of dimensions $(2i, 2i+1)$, the transformation between a position and its offset $k$ is a Rotation Matrix:

$$ \begin{bmatrix} PE_{pos+k, 2i} \ PE_{pos+k, 2i+1} \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} PE_{pos, 2i} \ PE_{pos, 2i+1} \end{bmatrix} $$

Why this matters:

  1. Linearity: The transformation $M_k$ consists only of constants ($\sin(\omega k)$ and $\cos(\omega k)$).
  2. Position Independence: The matrix $M_k$ depends only on the distance $k$, not the absolute position $pos$.
  3. Intuition: The model can learn to attend to "the word 3 positions to the left" using the same weights regardless of where in the sentence it is.

2.5 Comparison: Fixed vs. Learned Encodings

Unlike the fixed formula, Learned Positional Encodings treat positions as trainable parameters. We initialize a random embedding matrix of size $(Max_{Length} \times d)$. During training, the model learns a unique vector for "Index 0," "Index 1," etc., just as it learns vectors for "Apple" or "Orange." While this allows the model to adapt specifically to the training data's structure, it cannot generalize to sequences longer than the predefined $Max_{Length}$.

Feature Fixed (Sinusoidal) Learned (Embedding)
Parameters 0 (Calculated via formula) $Max_{Len} \times d$ (Trainable)
Extrapolation Can generalize to longer sequences Limited to max length seen in training
Usage Original Transformer BERT, GPT

3. The Attention Mechanism (Sequence-to-Sequence)

3.1 The Encoder-Decoder Bottleneck

In a standard Sequence-to-Sequence (Seq2Seq) model, the Encoder must compress the entire input sentence into a single fixed-size context vector ($c$).

The Problem:

  1. Information Loss: For long sentences, a single vector cannot capture all the nuances.
  2. Vanishing Gradient: By the time the decoder starts, the "signal" from the beginning of the input sentence has often faded.
  3. Efficiency: The model tries to "force-fit" variable-length input into a fixed-length bottleneck.

3.2 The Attention Intuition: "Looking Back"

Instead of forcing the Decoder to rely on one static vector, Attention allows the Decoder to "look back" at all hidden states produced by the Encoder.

  • Human Analogy: When a human translates "The cat sat on the mat." (English) to "Le chat était assis sur le tapis." (French) as they write the French word "chat", they are specifically looking at the English word "cat".
  • Mechanism: The model assigns a "weight" to each input word based on its relevance to the current word being generated.

3.3 Mathematical Formulation of Attention

Attention is computed in three distinct steps for every timestep $t$ in the Decoder.

Step 1: Scoring ($e_{ti}$)

Compare the current Decoder hidden state ($s_t$) with every Encoder hidden state ($h_i$) to find a "match" score. $$e_{ti} = \text{score}(s_t, h_i)$$

Step 2: Normalization ($\alpha_{ti}$)

Convert these raw scores into a probability distribution using the Softmax function. $$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j} \exp(e_{tj})}$$

  • Property: $\sum_{i} \alpha_{ti} = 1$. This ensures the "attention" is distributed across the input.

Step 3: The Context Vector ($c_t$)

Calculate a weighted sum of all Encoder states. Unlike the bottleneck model, this $c_t$ is dynamic and changes at every timestep $t$. $$c_t = \sum_{i} \alpha_{ti} h_i$$

3.4 Architecture Flow

graph LR
    subgraph Encoder
    h1[h1: The]
    h2[h2: cat]
    h3[h3: sat]
    end

    subgraph Attention_Layer
    Score[Score & Softmax]
    end

    subgraph Decoder
    st[st: Decoder State]
    ct[ct: Context Vector]
    end

    h1 --> Score
    h2 --> Score
    h3 --> Score
    st --> Score
    Score --> ct
    ct --> Output[Output: chat]

💡 Tutor Deep Dive: Word Embeddings vs. Attention Context Vectors

One common point of confusion in exams is the difference between the Word Embedding ($e_w$) we discussed in Section 1 and the Context Vector ($c_t$) discussed here. Here is the breakdown:

1. The Inputs (Static)

  • What: Word Embeddings (from Word2Vec/GloVe).
  • Source: A pre-trained lookup table.
  • Property: The vector for "bank" is always the same, regardless of whether it's a "river bank" or a "savings bank."
  • Role: These are the inputs fed into the Encoder RNN.

2. The Hidden States (Encoded)

  • What: Encoder Hidden States ($h_i$).
  • Source: The output of an LSTM/RNN at each timestep.
  • Property: These are context-aware. The $h_i$ for "bank" will look different if the previous words were "river" vs. "money."

3. The Context Vector (Dynamic)

  • What: The Attention Context Vector ($c_t$).
  • Source: A weighted sum of all $h_i$ ($c_t = \sum \alpha_{ti} h_i$).
  • Property: This is task-specific. At decoder step $t=1$, $c_t$ might focus on the Subject. At $t=2$, it might focus on the Verb.
  • Role: It effectively "solves" the bottleneck by allowing the Decoder to selectively ignore or emphasize specific parts of the input sentence at every step of the translation.
Feature Word Embedding Hidden State ($h_i$) Context Vector ($c_t$)
Level Word-level Sequence-level Decoder-level
Context? No (Static) Yes (Local/Past) Yes (Global/Relevant)
Changes? Fixed after training Changes per word Changes per output word

4 Self-Attention (QKV)

4.1 Concept: A Sequence "Talking to Itself"

Unlike the standard Attention in Section 3 (where a Decoder looks at an Encoder), Self-Attention allows a single sequence to interact with itself. This enables the model to create context-aware representations of words.

  • The Problem with RNNs: In an RNN, the "meaning" of a word is limited by how much information can survive the sequential chain.
  • The Self-Attention Solution: Every word in a sentence is compared to every other word simultaneously.
  • Example: In the sentence "The animal didn't cross the street because it was too tired," self-attention allows the model to associate "it" specifically with "animal" rather than "street."

4.2 The Database Analogy: Query, Key, and Value

To calculate these relationships, each input vector $x_i$ (which is Word Embedding + Positional Encoding) is projected into three distinct spaces using learned weight matrices $W_Q, W_K, W_V$.

Component Variable Analogy Technical Role
Query $q_i = x_i W_Q$ "The Search Term" What the current word is looking for in others.
Key $k_i = x_i W_K$ "The Label" How the current word identifies itself to others.
Value $v_i = x_i W_V$ "The Content" The actual information the word contributes.

4.3 The Scaled Dot-Product Attention Formula

For a sequence of words represented as a matrix $X$, the attention output is calculated as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Step-by-Step Breakdown:

  1. Similarity ($QK^T$): We compute the dot product of the Query matrix and the Transposed Key matrix. This produces a "Score Matrix" representing how much every word should attend to every other word.
  2. Scaling ($\frac{1}{\sqrt{d_k}}$): We divide by the square root of the dimension of the keys.
  3. Normalization (Softmax): We apply Softmax to ensure all attention weights for a given word sum to 1.
  4. Weighted Sum ($\times V$): We multiply the weights by the Value matrix to get the final context-aware representation.

4.4 The Scaling Factor ($\sqrt{d_k}$)

A critical component of the Attention formula is the division by $\sqrt{d_k}$ (the square root of the dimension of the Key vectors). This is not just a heuristic; it is an engineering requirement for stable training.

The Mathematical Problem:

As the dimension $d_k$ increases, the magnitude of the dot product $Q \cdot K^T$ tends to grow large. If we assume the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$, their dot product has:

  • Mean: $0$
  • Variance: $d_k$

The Consequence (Softmax Saturation):

When the variance is high ($d_k$), the resulting values fed into the Softmax function can be extremely large.

  1. Softmax converts these into values very close to $1$ or $0$.
  2. In these "flat" regions of the Softmax curve, the gradient is extremely small.
  3. This leads to the Vanishing Gradient problem, where the model stops learning because the weights $W_Q$ and $W_K$ cannot be updated effectively.

The Solution: By scaling by $\frac{1}{\sqrt{d_k}}$, we bring the variance of the dot product back to $1$, ensuring the Softmax stays in a region where gradients are steep enough for backpropagation.

💡 Tutor Deep Dive: The Library Analogy

To internalize $Q, K, V$ for your exams, think of a modern digital library system:

  1. Query ($Q$): You type a search term into the system: "How to train a dog." This is what the current word is looking for.
  2. Key ($K$): The system looks at the "Metadata" or "Index Labels" on the spine of every book in the library. One book is labeled "Canine Behavior", another is labeled "Baking Bread".
  3. Similarity Matrix ($QK^T$): The system calculates how well your search term matches every label. "How to train a dog" is a $0.95$ match for "Canine Behavior" but a $0.02$ match for "Baking Bread".
  4. Value ($V$): Once the match is confirmed, the system doesn't just give you the label; it gives you the actual pages/knowledge inside the book.

Key Takeaway: In Self-Attention, every word is simultaneously the Searcher ($Q$), the Index Label ($K$), and the Knowledge ($V$). After one layer of Attention, each word's original meaning is "updated" by the knowledge it pulled from the other words in the sentence.

4.5 Why Self-Attention is Powerful

Self-attention is the primary reason Transformers replaced RNNs (LSTMs/GRUs). Its superiority is defined by three key metrics:

1. Path Length (Long-Range Dependencies)

In an RNN, to connect the first word to the last word in a sequence of length $n$, the signal must pass through $n$ hidden state updates. This often leads to "forgetting" (vanishing gradients). In Self-Attention, the "path" between any two words is always $O(1)$—every word is just one dot-product away from every other word.

2. Parallelizability

RNNs are sequential; you cannot compute the 10th word until the 9th is finished. Self-Attention relies on matrix multiplications ($Q, K, V$), which modern GPUs can compute for the entire sentence simultaneously. This leads to a Maximum Parallelizable Operations of $O(1)$ vs. $O(n)$ for RNNs.

3. Computational Complexity

While Self-Attention is faster to parallelize, it has a "hidden cost" regarding sequence length ($n$):

Layer Type Complexity per Layer Max Path Length Parallelism
Self-Attention $O(n^2 \cdot d)$ $O(1)$ $O(1)$
Recurrent (RNN) $O(n \cdot d^2)$ $O(n)$ $O(n)$
Convolutional $O(k \cdot n \cdot d^2)$ $O(\log_k(n))$ $O(1)$
  • $n$: Sequence length.
  • $d$: Representation dimension ($d_{model}$).
  • $k$: Kernel size (for CNNs).

The Trade-off: Self-Attention is efficient when the dimension $d$ is larger than the sequence length $n$. However, because of the $O(n^2)$ complexity, Self-Attention becomes very "expensive" as sentences get extremely long (e.g., long documents or books).

💡 Tutor Deep Dive: The "N-Squared" Curse

When your professor asks about the "weakness" of Self-Attention, this is it: Quadratic Complexity. Because every word must look at every other word, if you double the length of your sentence, the computation doesn't double—it quadruples. This is why models like GPT or BERT have a "Context Window" limit (like 512 or 2048 tokens). Beyond that, the memory requirements for the $N \times N$ attention matrix ($QK^T$) exceed the capacity of even the most powerful GPUs.


5. Multi-Head Attention (MHA)

5.1 The Limitation of a Single Head

A single attention head produces only one attention pattern for a given input. However, natural language is complex, and words often share multiple types of relationships at the same time:

  • Syntactic: Focusing on grammar, such as subject-verb agreement (e.g., "The cat sat").
  • Coreference: Connecting pronouns back to their nouns (e.g., "cat ... it").
  • Semantic: Identifying situational context, such as locations (e.g., "sat ... mat").

A single head must "choose" which of these relationships to prioritize. By using multiple heads, the model can capture all these nuances in parallel.

5.2 The Multi-Head Mechanism

Multi-Head Attention splits the model's representation into several smaller subspaces to perform independent attention operations.

  1. Projection: The input $X$ is projected into $h$ different sets of Queries, Keys, and Values using learned weights.
  2. Parallel Attention: Each "head" calculates its own self-attention independently.
  3. Concatenation: The results from all $h$ heads are joined (concatenated) together.
  4. Final Linear Layer: A final weight matrix ($W_O$) projects the concatenated output back into the original model dimension.

5.3 The Multi-Head Formula

Mathematically, Multi-Head Attention is defined as:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W_O$$

Where each individual head is computed as:

$$\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)$$

Dimensional Efficiency

To keep the computational cost similar to single-head attention, we divide the total model dimension ($d_{model}$) by the number of heads ($h$): $$d_k = \frac{d_{model}}{h}$$

For example, if the model has a dimension of $768$ and $12$ heads, each head only works with a dimension of $64$. This ensures the total parameter count remains manageable.

5.4 Typical Configurations

The following table shows how popular models configure their attention heads:

Model $d_{model}$ Heads ($h$) $d_k$ per head
BERT-base 768 12 64
BERT-large 1024 16 64
GPT-2 768 12 64
GPT-3 12288 96 128

💡 Tutor Deep Dive: The "Committee of Experts"

Think of Multi-Head Attention like a team of specialized researchers analyzing a document:

  • Researcher 1 (Head 1): Only looks for Grammar errors.
  • Researcher 2 (Head 2): Only looks for Factual inconsistencies.
  • Researcher 3 (Head 3): Only looks for Emotional tone.

Each researcher is "smaller" and more focused than a single person trying to find everything. By the end of the day, you concatenate their individual reports into one comprehensive summary ($W_O$). This committee approach is why Transformers are so much better at understanding the "flavor" and "structure" of language than previous models.