11 The Transformer Architecture (Encoder‐Decoder) - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

1. High-Level Architecture Overview

The Transformer follows an Encoder-Decoder structure:

Encoder: Processes the input sequence into a continuous representation (contextual embeddings).
Decoder: Uses the encoder's representation to generate an output sequence, one token at a time.

graph TD
    In[Input Tokens] --> PE1[Embeddings + Positional Encoding]
    PE1 --> Enc[Encoder Stack]
    Enc --> Cross[Cross-Attention Bridge]
    Out[Shifted Output Tokens] --> PE2[Embeddings + Positional Encoding]
    PE2 --> Dec[Decoder Stack]
    Cross --> Dec
    Dec --> Lin[Linear + Softmax]
    Lin --> Prob[Output Probabilities]

The Information Flow: From Context to Generation

The Transformer architecture represents a fundamental departure from the sequential processing of Recurrent Neural Networks (RNNs). At its core, the model is designed to handle long-range dependencies and enable massive parallelization by replacing recurrence with a mechanism known as Attention. The diagram above illustrates the macro-level interaction between the two primary components: the Encoder and the Decoder.

The process begins at the Input Stage, where raw tokens are converted into high-dimensional vectors via Embeddings. Because the Transformer lacks an inherent sense of sequence, Positional Encodings are injected into these embeddings to preserve the structural order of the sentence. This combined representation then flows into the Encoder Stack. The Encoder acts as a deep feature extractor; it processes the entire input sequence simultaneously using bidirectional self-attention. This allows every token to "attend" to every other token in the sequence, resulting in a rich, contextualized latent representation where the meaning of a specific word (e.g., "bank") is informed by its surrounding context ("river" vs. "money").

Once the Encoder has distilled the input into these complex feature vectors, they are passed across the Cross-Attention Bridge. This is the critical intersection where the model aligns the source data with the target output. The Decoder uses this information to begin the generation process. Unlike the Encoder, the Decoder is auto-regressive—it generates the output sequence one token at a time. It utilizes its own previous outputs as its current input, ensuring that the generation remains coherent based on what has already been produced, while simultaneously "looking back" at the Encoder's notes to stay grounded in the original input.

Finally, the Decoder’s internal representations are passed through a Linear layer and a Softmax function. This converts abstract hidden states into a probability distribution across the entire vocabulary, allowing the model to select the most likely next token. This cycle repeats—appending each new token to the decoder input—until an end-of-sequence signal is reached. By decoupling the "understanding" phase from the "generation" phase, the Transformer achieves a level of efficiency and accuracy that forms the basis of all modern Large Language Models (LLMs).

Tutor Insight: Think of the Encoder as the "reader" that understands the full context of the source text, and the Decoder as the "writer" that generates the target text while periodically looking back at the "reader's" notes.

2. Input Processing: Positional Encoding

Since the Transformer does not use recurrence or convolution, it has no inherent way of knowing the relative or absolute positions of tokens in a sequence. To address this, we inject Positional Encodings into the input embeddings.

The Problem of Permutation Invariance

Without positional info, the self-attention mechanism treats the sequence as a "bag of words." The attention score between "Dog" and "Bites" would be identical regardless of whether the sentence is "Dog bites man" or "Man bites dog."

Implementation: Sinusoidal Encodings

The original Transformer uses fixed trigonometric functions to generate a unique signature for each position $pos$ and dimension $i$:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Why Sine/Cosine? These functions allow the model to easily learn to attend by relative positions, as for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.
Integration: These encoding vectors are added (not concatenated) directly to the input word embeddings.

Deep Dive: Why add instead of concatenate? Adding preserves the dimensionality ($d_{model}$), keeping the model computationally efficient. Even though it "corrupts" the embedding slightly, the model learns during training to separate the semantic signal from the positional signal.

3. The Encoder: Multi-Head Self-Attention (MHSA)

The Encoder's job is to create a high-level representation of the input. It consists of a stack of $N$ identical layers (typically $N=6$). The output of each layer serves as the input to the next, allowing the model to learn increasingly abstract features of the language.

graph BT
    subgraph Encoder_Block
    MHSA[Multi-Head Self-Attention] --> AN1[Add & Norm]
    AN1 --> FFN[Step C: Feed Forward]
    FFN --> AN2[Add & Norm]
    end
    In[Input Tokens + Positional Encoding] --> MHSA
    AN2 --> Out[Encoder Output]

Step A: Scaled Dot-Product Attention

The fundamental building block is the Attention mechanism. For every input, we generate three vectors via learned weight matrices $W^Q, W^K, W^V$:

Query ($Q$): Represents the current token looking for context.
Key ($K$): Represents all tokens being looked at (acts as an index).
Value ($V$): The actual content information to be extracted.

The attention output is calculated as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Step B: The "Multi-Head" Concept

Rather than performing a single attention function, we split the $Q, K, V$ into $h$ different "heads."

Parallel Learning: Each head can focus on a different type of relationship (e.g., one head focuses on grammar, another on coreference resolution like "he" referring to "John").
Concatenation: The outputs of all heads are concatenated and projected back to the original dimension.

Tutor Insight: In exams, you might be asked why we scale by $\sqrt{d_k}$. Remember: as dimensions grow, the dot product $QK^\top$ can grow very large in magnitude, pushing the softmax into regions with extremely small gradients. The scaling factor "tames" these values to ensure stable backpropagation.

Step C: Position-wise Feed-Forward Network (FFN)

While the Attention mechanism is responsible for "routing" information between tokens, the FFN is responsible for "processing" that information. It is applied to each token position separately and identically.

The Formula

The FFN consists of two linear transformations with a ReLU activation in between:

$$\text{FFN}(x) = \left[ \text{ReLU}(xW_1 + b_1) \right] W_2 + b_2$$

Key Characteristics

Expansion-Contraction: It typically follows a "bottleneck" structure. The first layer expands the input dimension ($d_{model} = 512$) to a larger latent space ($d_{ff} = 2048$), and the second layer projects it back to $d_{model}$.
Position-wise: The same weights ($W_1, W_2$) are applied to every token in the sequence. There is no interaction between different tokens in this step.

Why is it needed?

Non-Linearity: Self-attention is essentially a weighted average (a linear operation). The FFN introduces the non-linearity (via ReLU) necessary for the model to learn complex, high-order functions.
Knowledge Storage: If Attention is the "search engine" that finds relevant context, the FFN is the "processor" that transforms those raw signals into meaningful features. It acts as a local memory for the model to store factual and syntactic patterns.

Deep Dive: In practice, the FFN accounts for roughly two-thirds of the total parameters in a Transformer. This confirms its role as the primary "computational engine" where the actual transformation of data occurs.

Step D: Bidirectional Nature & Stacking

Unlike previous sequence models (like standard LSTMs), the Encoder is natively Bidirectional.

Full Context: In Step A, when calculating the attention for a token, the mask is all ones. This means a word at the start of the sentence can "see" a word at the end of the sentence and vice versa.
The Stack: By stacking $N$ layers, the model performs "Iterative Refinement." Layer 1 might understand basic syntax, while Layer 6 understands complex semantic relationships across the entire paragraph.

3.5 The Add & Norm Layer: Residuals and LayerNorm

Every sub-layer in the Transformer (Attention and FFN) is wrapped in a Residual Connection followed by Layer Normalization. This "Add & Norm" structure is what makes training extremely deep Transformer stacks possible.

3.5.1 Residual (Skip) Connections

The output of a sub-layer is added back to its input: $\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$.

Gradient Flow: By adding the original input $x$ to the result, we create a "highway" for gradients to flow directly during backpropagation, solving the Vanishing Gradient Problem.
Information Preservation: It allows the model to retain original information from lower layers while only learning "residual" changes in higher layers.

3.5.2 Layer Normalization vs. Batch Normalization

While Batch Normalization (BN) is the standard for Computer Vision, Transformers exclusively use Layer Normalization (LN) because BN is ill-suited for the dynamic nature of sequential data.

Feature	Batch Normalization (BN)	Layer Normalization (LN)
Normalization Axis	Across the Batch (vertical).	Across the Features (horizontal).
Sequence Length	Struggles with variable lengths/padding.	Robust to any sequence length.
Min. Batch Size	Requires large batches to be stable.	Works perfectly with Batch Size 1.
Phase Behavior	Different: Uses running stats at inference.	Identical: Same calculation in all phases.

Why Phase Consistency Matters

A major advantage of Layer Normalization is that it is deterministic and independent of other samples. Because it calculates the mean ($\mu$) and variance ($\sigma^2$) using only the current token's features, the mathematical operation remains exactly the same during both training and real-time inference. In contrast, Batch Normalization must "estimate" stats during training to use later at inference, which often leads to performance discrepancies.

Tutor Insight: This "Phase Consistency" is a lifesaver for production models. It means you don't have to worry about your model behaving differently when a single user chats with it versus when you were training it on a cluster of 8 GPUs. If it worked in training, the normalization will work the same way in the app.

3.5.3 The LayerNorm Formula

Layer Normalization stabilizes training by keeping activations in a reasonable range. It is computed per sample, across features:

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Mean ($\mu$) and Variance ($\sigma^2$): Calculated across the hidden dimensions for each individual token.
Learnable Parameters ($\gamma, \beta$): These allow the model to shift and scale the normalized values back if necessary.

3.5.4 Architecture Variations: Post-LN vs. Pre-LN

Modern Transformers (like GPT) often use Pre-LN, while the original Transformer used Post-LN.

Post-LN (Original): Normalization occurs after the addition.
Pre-LN (Modern): Normalization occurs before the sub-layer (Attention or FFN). This is generally considered more stable for training very large models.

Tutor Insight: Why is Batch Norm so bad for LLMs? In a batch of sentences, some might be 5 words long and others 500. Padding the short sentences to match the long ones creates "fake data" that ruins the statistical mean and variance calculation in Batch Norm. Layer Norm avoids this entirely by treating every token's vector independently.

4. The Decoder: Auto-regressive Generation

While the Encoder is designed to understand the entire input at once, the Decoder is designed to generate the output sequence one token at a time. This is known as auto-regressive generation.

graph BT
    subgraph Decoder_Block
    MHA[4.1 Masked Self-Attention] --> AN1[Add & Norm]
    AN1 --> CA[4.2 Cross-Attention]
    EncOut[Encoder Output] -.-> CA
    CA --> AN2[Add & Norm]
    AN2 --> FFN[4.3 Feed Forward]
    FFN --> AN3[Add & Norm]
    end
    In[Shifted Output Tokens] --> MHA
    AN3 --> Out[To Linear/Softmax]

4.1 Masked Multi-Head Self-Attention

While the Encoder is bidirectional, the Decoder's self-attention is Unidirectional (Causal).

No Future Sight: In the Decoder, we cannot allow bidirectional attention. If the model is trying to predict the 3rd word, it must not be able to "see" the 4th or 5th word.
The Mask: We use a triangular mask (setting future scores to $-\infty$). This ensures that for any token $i$, the attention weights for all $j > i$ are zero.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + \text{Mask}}{\sqrt{d_k}}\right)V$$

The Exception: Note that Cross-Attention (4.2) does look at the entire Encoder output. This is safe because the Encoder's input is the source language (which is already fully known), not the target words the model is currently trying to generate.

4.2 Multi-Head Cross-Attention (Encoder-Decoder Attention)

This layer is the "bridge" that allows the decoder to access the information processed by the encoder. It works similarly to standard attention, but the source of the vectors changes:

Queries ($Q$): Come from the previous layer of the Decoder (representing the current word being generated).
Keys ($K$) and Values ($V$): Come from the final output of the Encoder stack (representing the context of the entire input sentence).

Deep Dive on Stacking: The Keys ($K$) and Values ($V$) for the Cross-Attention in every one of the $N$ decoder layers come from the final output of the last Encoder layer. The Decoder layers "reuse" the same high-level context from the Encoder stack throughout the entire generation process. By using the Decoder's query to probe the Encoder's keys, the model identifies which parts of the input sentence are most relevant to the word currently being produced.

4.3 Position-wise Feed-Forward Network (FFN)

After the two attention stages, the decoder uses the same FFN architecture as the encoder to process the combined information.

$$\text{FFN}(x) = \left[ \text{ReLU}(xW_1 + b_1) \right] W_2 + b_2$$

Function in the Decoder

Integration: It takes the output of the cross-attention layer—which contains both "what has been said" and "what was in the input"—and transforms it into a richer representation.
Preparation: This representation is then ready for the final Linear and Softmax layers to convert the hidden state into an actual word prediction.

Deep Dive: Why have two attention layers in the decoder but only one in the encoder? The encoder only needs to understand the input. The decoder, however, has two tasks: it must understand the internal consistency of the output it has generated so far (Masked Self-Attention) and it must maintain faithfulness to the original input (Cross-Attention).

5. Output Stage: Linear and Softmax

Once the final Decoder block has finished processing the sequence, the resulting vectors must be converted into a format that allows us to select a specific word from our vocabulary. This is handled by a two-step final layer.

5.1 The Linear Layer

The output of the Decoder stack is a vector of size $d_{model}$ (e.g., 512). The Linear Layer is a simple fully connected (dense) layer that projects this vector into a much larger space: the Vocabulary Size.

Mapping: If your model knows 50,000 words, this layer expands the 512-dimension vector into a 50,000-dimension vector.
Logits: The raw numbers produced by this layer are called "Logits." These represent the "score" for each word in the vocabulary, but they are not yet probabilities.

5.2 The Softmax Layer

To turn these scores into a readable format, we apply the Softmax function. This transforms the logits into a probability distribution:

$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

Sum to 1: Every word in the vocabulary is assigned a probability between 0 and 1, and the sum of all probabilities equals 1.
Selection: During inference, the word with the highest probability is typically selected as the model's prediction for that position.

Final Output Summary Table

Stage	Input Shape	Output Shape	Purpose
Decoder Output	$[Batch, Seq, d_{model}]$	$[Batch, Seq, d_{model}]$	Rich contextual hidden state.
Linear Layer	$[Batch, Seq, d_{model}]$	$[Batch, Seq, Vocab]$	Maps hidden state to word scores (Logits).
Softmax	$[Batch, Seq, Vocab]$	$[Batch, Seq, Vocab]$	Converts scores to probabilities.

Tutor Insight: In real-world data science, we don't always just take the highest probability word (Greedy Search). We often use techniques like Beam Search or Nucleus Sampling (Top-p) to make the generation more diverse and less repetitive. However, for understanding the architecture, the Softmax layer is the definitive "finish line."