12 Transformer Variants & Foundation Models - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Modern Transformer Variants: BERT & GPT
1. BERT: Bidirectional Encoder Representations from Transformers
BERT marks a pivotal shift toward Transfer Learning in NLP. Instead of training a specialized model from scratch for every new task, we pre-train an Encoder-only architecture on a massive, unlabeled corpus to learn general linguistic patterns.
For different specific downstream applications, this pre-trained BERT model serves as a high-quality "starting point." Through Fine-Tuning, we add a small task-specific layer on top and train the entire stack on a much smaller, labeled dataset. This allows the model to leverage its existing "knowledge" of language to excel at specific tasks with far less data.
graph TD
subgraph BERT_Architecture [BERT: Encoder-Only / Bidirectional]
B_In[Input Tokens: CLS, The, cat, sat, SEP] --> B_Enc[BERT Encoder: 12-24 Layers]
B_Enc --> B_Attn{Bidirectional Attention}
B_Attn --> B_Out["Contextualized Embeddings: h_0(CLS), h_1, h_2, ..., h_n(SEP)"]
B_Out -- "Slice at index 0" --> B_Task[Task: Classification/NER]
end
1.1 Architecture: Encoder-Only
- Structure: Composed of 12 (Base) to 24 (Large) layers of Transformer Encoders.
- Key Mechanism: Bidirectional Self-Attention. Unlike traditional models that read text left-to-right, BERT looks at the entire sequence simultaneously.
- Output: Generates Contextualized Embeddings ($h_{CLS}, h_1, \dots, h_{SEP}$) where the representation of a word (e.g., "bank") changes based on its surrounding context ("river bank" vs. "bank account").
- Indexing: Special tokens are positioned at the boundaries: $h_{CLS}$ is always at position $0$ ($h_0$), and $h_{SEP}$ is at the final position $n$ ($h_n$).
1.2 Pre-training: Masked Language Modeling (MLM)
Standard language modeling (predicting the next word) is impossible for a bidirectional model because the model could "see" the answer in the future context. BERT solves this via Masking:
- The Task: Randomly mask $15%$ of input tokens (using the
[MASK]token) and force the model to predict them. - Objective: Forces the model to learn deep semantic relationships and syntax to "fill in the blanks."
Tutor Insight: Before BERT, we used static embeddings like Word2Vec. The problem? "Bank" always had the same vector regardless of meaning. BERT’s transfer learning approach changed the game because the "starting point" isn't just a vocabulary list; it's a model that understands how words interact in a sentence.
1.3 The [CLS] and [SEP] Tokens
[CLS](Classification): A special token always placed at Position 0. Its final hidden state ($h_{CLS}$) serves as a summary of the entire sequence for sentence-level tasks.[SEP](Separator): Used to denote the end of a sentence or to separate two different sentences in tasks like Natural Language Inference (NLI).
1.4 Downstream Fine-Tuning
| Task | Approach |
|---|---|
| Sentiment Analysis | Attach a classifier head to the [CLS] token output. |
| NER (Named Entity Recognition) | Predict a label for every individual token output ($h_1, h_2...$). |
| Q&A (SQuAD) | Predict the start and end token positions within the text. |
| Sentence Similarity | Compare the cosine similarity between two [CLS] embeddings. |
Tutor Insight: Think of BERT as a "Context Expert." It is excellent at extracting features and understanding nuance, making it the industry standard for search, classification, and entity extraction. However, it is fundamentally bad at generating long-form text because it wasn't trained to predict the "next" word.
1.5 Deep Dive: The [CLS] Token
The [CLS] token is a "dummy" token added to the start of every input sequence. It does not represent a real word, but rather acts as a pooling mechanism for the entire sentence.
Why is it at the beginning?
Technically, because the Transformer uses Self-Attention, the [CLS] token could be placed anywhere (beginning, middle, or end) and it would still "see" every other word in the sentence. However, placing it at Position 0 offers several advantages:
- Consistency: It provides a fixed, predictable location for the model to extract a single vector representing the whole sequence, regardless of how long the sentence is.
- Engineering Simplicity: When writing code (e.g., in PyTorch or TensorFlow), you can always slice the output tensor at index
0(output[:, 0, :]) to get the sentence-level representation.
How does it relate to Classification?
In a Transformer Encoder, every token in a layer attends to every other token in the previous layer. This means:
- By the time the signal reaches the final layer, the hidden state of the
[CLS]token ($h_{CLS}$) has "absorbed" information from every other token in the sentence via self-attention. - The Math: For a classification task (like Sentiment Analysis), we take $h_{CLS} \in \mathbb{R}^d$ and pass it through a simple linear layer with weights $W$:
$$\text{Logits} = \text{Softmax}(W \cdot h_{CLS} + b)$$
- The model is trained to compress the "essence" of the entire sentence into this single vector during the pre-training phase (specifically for the Next Sentence Prediction task).
Tutor Insight: Think of the
[CLS]token as the "Class Representative." While the other tokens are busy figuring out their own contextual meanings, the[CLS]token's only job is to sit in the front of the room, listen to everyone else, and summarize the collective "vibe" of the sentence for the final classifier.
2. GPT: Generative Pre-trained Transformer
Unlike the original Transformer (which had both an Encoder and a Decoder), GPT is a Decoder-only architecture. It does not receive input from an external encoder; instead, it treats the initial user prompt as the starting "context."
graph TD
subgraph GPT_Process [GPT: Autoregressive Decoder-Only]
direction TB
Prompt["Input Context (Prompt): 'The cat'"] --> G_Block["GPT Decoder Stack"]
G_Block --> Mask["Causal Masking (Prevents seeing future)"]
Mask --> Out1["Output: 'sat'"]
Out1 -- "Append to Sequence" --> Feedback["New Context: 'The cat sat'"]
Feedback --> G_Block
G_Block --> Out2["Output: 'on' <masked>"]
end
2.1 Architecture: Decoder-Only
- Mechanism: Uses Causal (Masked) Self-Attention.
- No Cross-Attention: Since there is no encoder, the "Cross-Attention" layers found in the original Transformer are removed. Every layer in GPT is a self-attention layer.
- Directionality: Each token can only attend to its past (left context). It is strictly prohibited from seeing the "future" during training.
- Training Objective: Predict the next token ($token_t$) given all previous tokens. $$P(token_t | token_1, \dots, token_{t-1})$$
2.2 Autoregressive Generation Process
GPT generates text one step at a time, where the output of step $n$ becomes part of the input for step $n+1$.
- Input (The Prompt): "The cat" (Tokens $x_1, x_2$)
- Predict: The model outputs a distribution; we sample "sat" ($x_3$).
- Feedback Loop: The new input sequence is now ["The", "cat", "sat"].
- Repeat: The model now predicts $x_4$ based on $x_1, x_2, x_3$.
2.3 Why "Decoder-Only"?
By stripping away the encoder, GPT is optimized for unconstrained generation. While BERT is great at "filling in the blanks" (understanding), GPT is optimized for "continuing the thought" (generation).
Tutor Insight: If you see a diagram with two towers connected in the middle, that's an Encoder-Decoder (like T5 or the original Transformer for translation). If you see a single tower that only looks backward, that's a Decoder-only (GPT). The "context" isn't coming from another model; it's just the history of the conversation so far.
2.4 Causal Masking: Preventing "Cheating"
To train GPT effectively, we use a Triangular Mask in the self-attention calculation. This ensures that when the model is learning to predict the 3rd word, it physically cannot see the 4th, 5th, or 6th words in the training data.
| Token | Can See |
|---|---|
| The | [The] |
| cat | [The, cat] |
| sat | [The, cat, sat] |
Deep Dive: Why the "Masked" Attention in GPT? If we didn't mask the future tokens in the self-attention matrix, the model would simply "cheat" by looking at the word it is supposed to predict. The causal mask ensures the model actually learns the underlying patterns of language.
2.5 From GPT to ChatGPT: The Role of RLHF
While a standard GPT model is an expert at predicting the next word, it doesn't naturally know how to be a "helpful assistant." To bridge this gap, ChatGPT uses Reinforcement Learning from Human Feedback (RLHF).
- The Intuition: Imagine the base GPT model is a student who has read the entire internet but has no manners. RLHF is like a "polishing" phase where human tutors rank the model's various answers from "Best" to "Worst."
- The Goal: This process teaches the model to prioritize answers that are Helpful, Honest, and Harmless, moving it from simply "predicting text" to "following instructions."
graph LR
Base[Base GPT Model] -- "Trained on: 'The Internet'" --> Knowledge["Result: Knowledgeable but Unpredictable"]
Knowledge -- "Human Trainers Rank Outputs" --> RLHF{RLHF Process}
RLHF -- "Learns Human Preferences" --> Assistant["Result: ChatGPT (Instruction Follower)"]
style Base fill:#f9f,stroke:#333
style Assistant fill:#bbf,stroke:#333
Tutor Insight: For your exams, remember that "Pre-training" (the next-token prediction) provides the knowledge, while "RLHF" (the reinforcement learning) provides the behavior and safety. This is why ChatGPT feels much more conversational than the raw GPT-3 model.
2.6 Scaling Laws & The GPT Family
A core finding of the GPT project is that model performance (measured by Loss) improves predictably as we scale three factors: Parameters, Data Size, and Compute.
| Model | Year | Parameters | Notable Feature |
|---|---|---|---|
| GPT-1 | 2018 | 117M | Proved generative pre-training works. |
| GPT-2 | 2019 | 1.5B | Zero-shot capabilities emerged. |
| GPT-3 | 2020 | 175B | Emergent abilities: In-context learning (few-shot). |
| ChatGPT | 2022 | ~GPT-3.5 | Added RLHF (Reinforcement Learning from Human Feedback). |
| GPT-4 | 2023 | ~1T? | Multimodal and massive reasoning jump. |
3. Comparison: BERT vs. GPT vs. T5
The modern NLP landscape is divided based on which part of the original Transformer architecture is utilized.
graph TD
subgraph Comparison [Transformer Architecture Families]
EO[Encoder-Only / BERT] -- "Optimized for" --> Und[Natural Language Understanding NLU]
DO[Decoder-Only / GPT] -- "Optimized for" --> Gen[Natural Language Generation NLG]
ED[Encoder-Decoder / T5] -- "Optimized for" --> Trans[Translation / Seq2Seq]
end
style EO fill:#d4edda,stroke:#155724
style DO fill:#fff3cd,stroke:#856404
style ED fill:#d1ecf1,stroke:#0c5460
| Feature | BERT | GPT | T5 / BART |
|---|---|---|---|
| Architecture | Encoder-Only | Decoder-Only | Encoder-Decoder |
| Context | Bidirectional (Full) | Causal (Left-to-Right) | Bidirectional + Causal |
| Goal | Understanding (NLU) | Generation (NLG) | Sequence-to-Sequence |
| Pre-training | Masked Language (MLM) | Next Token Prediction | Span Masking / Denoising |
| Use Cases | Classification, NER, Search | Chatbots, Creative Writing | Translation, Summarization |
Tutor Insight: If you are asked on an exam which model to use for Legal Document Analysis, choose BERT (it needs to understand every word in context). If asked what to use for a Poetry Generator, choose GPT. If asked for English-to-French Translation, choose an Encoder-Decoder model.
4. ViT: Vision Transformer
The Vision Transformer (ViT) represents a paradigm shift in computer vision by proving that the Transformer architecture—originally designed for NLP—is "modality-agnostic." It effectively replaces traditional Convolutional Neural Networks (CNNs) with a standard Transformer Encoder.
4.1 The Key Insight: Image Patches as Tokens
The fundamental breakthrough of ViT is treating an image exactly like a sentence.
- Tokenization: An input image (e.g., $224 \times 224$) is split into fixed-size square patches (e.g., $16 \times 16$).
- Visual Words: These patches are flattened and treated as a sequence of "visual words."
- Mantra: "An Image is Worth 16x16 Words."
4.2 Architecture and Workflow
The architecture remains almost identical to the BERT encoder, which allows it to scale efficiently with massive datasets. The ViT architecture follows a strictly sequential flow. The image is "serialized" into a sequence of tokens, allowing the Transformer to treat visual data with the same mathematical framework used for text in BERT.
graph TD
subgraph Input_Stage [1. Input Processing]
Img[Input Image: 224x224x3] --> Split[Split into Fixed-size Patches: 16x16]
Split --> Flat[Flatten Patches: 16*16*3 = 768]
end
subgraph Embedding_Stage [2. Linear Projection & Positional Encoding]
Flat --> Proj[Linear Projection to Dimension D]
Proj --> Add_CLS[Prepend Learnable CLS Token]
Add_CLS --> Add_Pos[Add 1D Position Embeddings]
end
subgraph Encoder_Stage [3. Transformer Backbone]
Add_Pos --> L_Layers[Transformer Encoder: L Layers]
L_Layers --> MSA[Multi-Head Self-Attention]
MSA --> Norm[Layer Normalization]
Norm --> MLP_Inside[MLP Block]
end
subgraph Output_Stage [4. Head]
MLP_Inside --> Extract["Extract h_0 (CLS Output Vector)"]
Extract --> MLP_Head[MLP Classification Head]
MLP_Head --> Class[Final Prediction: Image Class]
end
style Input_Stage fill:#f5f5f5,stroke:#333
style Embedding_Stage fill:#e1f5fe,stroke:#01579b
style Encoder_Stage fill:#fff3e0,stroke:#e65100
style Output_Stage fill:#f1f8e9,stroke:#33691e
- Linear Projection: Raw patches are projected into a constant latent vector size $D$ using a learnable matrix.
- Special [CLS] Token: Similar to BERT, a learnable
[CLS]embedding is prepended to the sequence. The final hidden state of this token ($h_{CLS}$) is used for the image classification task. - Position Embeddings: Since Transformers have no inherent sense of spatial order, learnable 1D position embeddings are added to the patch embeddings to retain the geometry of the image.
- Transformer Encoder: The sequence passes through $L$ layers of Multi-Head Self-Attention (MSA) and MLP blocks.
- MLP Head: A final classification head is attached to $h_{CLS}$ to predict the object category.
4.3 Patch Embedding: The Math
To convert a 3D image patch into a 1D vector that the Transformer can process, we use the following projection:
- Flattening: A $16 \times 16 \times 3$ (RGB) patch is flattened into a vector $x_p$ of size $768$.
- Initial Input ($z_0$): The input sequence to the first layer is defined as: $$z_0 = [x_{cls}; x_p^1 E; x_p^2 E; \dots; x_p^N E] + E_{pos}$$
- Variables:
- $x_{cls}$: The learnable [CLS] token.
- $E$: The Patch Embedding projection matrix.
- $E_{pos}$: The Position Embedding matrix.
4.4 ViT vs. CNN: Inductive Bias and Scaling
CNNs (like ResNet) possess Inductive Bias (Translation Invariance and Locality). ViT lacks these, which leads to distinct performance trade-offs:
- Data Hunger: ViT generally requires significantly more training data (e.g., JFT-300M) because it must learn the concept of spatial relationships from scratch.
- The Tipping Point: On small datasets (ImageNet-1k), CNNs often outperform ViT. However, as data scales to "foundation model" sizes, ViT performance overtakes CNNs and scales much more efficiently.
4.5 Why ViT Works: Global Context
The primary advantage of the attention mechanism over convolution is the Global Receptive Field.
- Global Attention: In a CNN, a pixel only "sees" its immediate neighbors. In ViT, every patch can attend to every other patch in the entire image from the very first layer.
- Feature Evolution:
- Early Layers: The model learns to focus on both local edges and global structures simultaneously.
- Later Layers: The model focuses on complex object-level reasoning across the entire image.
Tutor Insight: Think of a CNN as someone exploring a dark room with a small flashlight—they see edges and corners first and slowly piece the room together. ViT is like someone who turns on the overhead lights instantly; they see the "big picture" immediately, but they might need to look at thousands of different rooms to understand what a "kitchen" generally looks like.
Deep Dive: Why 1D Position Embeddings? Interestingly, the ViT paper found that 2D position embeddings (explicitly telling the model X and Y coordinates) didn't provide a significant boost over simple 1D sequences. The model is powerful enough to learn the 2D structure of the image just from the 1D order!
4.6 Deep Dive: Learnable Positional Embeddings
In the original ViT paper, the authors experimented with various ways to tell the model "where" a patch is located. They reached a surprising conclusion that simplified the architecture significantly.
- Learnable vs. Fixed: Unlike the original Transformer which used fixed sine/cosine functions, ViT uses learnable parameters ($E_{pos} \in \mathbb{R}^{(N+1) \times D}$). These are initialized randomly and optimized via backpropagation.
- 1D vs. 2D: Even though an image is 2D, ViT typically uses a 1D sequence of embeddings. You might think this would hide the 2D structure, but it doesn't.
- The "Discovery": Because the model is trained on millions of images, the self-attention mechanism learns that patch #1 is spatially close to patch #2 (horizontally) and patch #17 (vertically, if the grid is 16x16).
Evidence of 2D Learning
If we visualize the Cosine Similarity between the learned positional embeddings, we see a clear 2D grid pattern:
- Embeddings for patches in the same row have high similarity.
- Embeddings for patches in the same column have high similarity.
- The model effectively "reconstructs" the 2D map of the image inside its 1D vector space.
Tutor Insight: This is a perfect example of Representation Learning. We don't have to hard-code the rules of geometry into the model (like we do with the "sliding window" in CNNs). Given enough data, the Transformer inductively learns that images have a 2D structure because that is the most efficient way to reduce the loss function.
5. CLIP: Contrastive Language-Image Pre-training
CLIP (OpenAI, 2021) represents a shift from "learning to label" to "learning to match." Traditionally, computer vision models were limited to a fixed set of classes. CLIP breaks this by learning from 400 million image-text pairs found on the internet.
5.1 The Process: Pre-training vs. Inference
The CLIP workflow is divided into two distinct phases:
- Contrastive Pre-training: The model learns a shared embedding space using a batch of $N$ image-text pairs.
- Optimizing the Diagonal: In a batch of $N$ pairs, there are $N^2$ possible combinations. However, only the pairs where the indices match (e.g., $I_1$ with $T_1$) are correct. These form the diagonal of the similarity matrix. Training focuses on maximizing these diagonal values while minimizing all off-diagonal "noise" (incorrect pairings).
- Zero-Shot Prediction: At test time, the model predicts which text prompt best matches the input image vector by calculating the highest cosine similarity.
graph TD
subgraph Training_Process [1. Training: Contrastive Alignment]
direction TB
Batch_I[Image Batch i=1..N] --> I_Enc[Image Encoder]
Batch_T[Text Batch j=1..N] --> T_Enc[Text Encoder]
I_Enc --> Matrix[NxN Similarity Matrix]
T_Enc --> Matrix
Matrix -- "InfoNCE Loss" --> Update[Optimize Diagonal]
end
%% Added a spacer link to ensure vertical stacking
Update --- Spacer(( )) --- New_Img
style Spacer label:none,fill:none,stroke:none
subgraph Inference_Process [2. Inference: Zero-Shot Classification]
direction TB
New_Img[Input Image] --> I_Enc_Inf[Image Encoder]
Prompts["'A photo of a cat' <br> 'A photo of a dog' <br> 'A photo of a car'"] --> T_Enc_Inf[Text Encoder]
I_Enc_Inf --> Sim_Rank[Ranking by Cosine Similarity]
T_Enc_Inf --> Sim_Rank
Sim_Rank --> Top1[Predicted Label]
end
style Training_Process fill:#fdf2f2,stroke:#a94442
style Inference_Process fill:#f2f9ff,stroke:#31708f
5.2 Mathematical Core: InfoNCE Loss
To train the shared space, CLIP uses the InfoNCE (Information Noise Contrastive Estimation) loss. It is essentially a Softmax over the similarity scores:
$$\mathcal{L}{i}^{(I)} = -\log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)}$$
- $\text{sim}(I, T)$: Cosine similarity between vectors.
- Numerator (The Match): This value is high when the image $I_i$ and text $T_i$ are semantically close. As they align in the embedding space, their cosine similarity approaches $1$, maximizing the numerator.
- Denominator (The Noise): This is the sum of scores for all text prompts in the batch. It is large because it includes the correct match plus $N-1$ incorrect matches. The model learns to "push" the numerator so high that it outweighs the collective "noise" of the denominator.
- Temperature ($\tau$): A learnable parameter that scales the distribution.
- A small $\tau$ (< 1) makes the distribution "sharper," forcing the model to be more confident and punishing small errors severely.
- A large $\tau$ makes the distribution "flatter," allowing for more overlap/uncertainty in the embedding space.
5.3 Why Contrastive Learning Works
Unlike "Generative" training (predicting pixels), Contrastive Learning is efficient because the model only needs to learn to distinguish concepts rather than reconstruct them from scratch.
CLIP is significantly more robust than traditional models like ResNet-101:
- Generalization: CLIP maintains high accuracy on sketches, cartoons, or distorted images because it has learned the linguistic concept (e.g., "dog-ness") rather than just specific pixel textures.
| Feature | Standard Computer Vision (CV) | CLIP |
|---|---|---|
| Labels | Fixed (e.g., 1,000 classes) | Open (Natural Language) |
| Learning Task | Predicting a Class Index | Matching Image to Text |
| Flexibility | Needs retraining for new labels | Zero-shot (No retraining) |
| Strength | High precision on specific tasks | High robustness and generalization |
Tutor Insight: While a standard CV model is like a specialist who only knows 1,000 words, CLIP is like a generalist who has read the whole dictionary and can apply that knowledge to pictures.
5.4 Zero-Shot Classification & Prompt Engineering
CLIP's classification relies on "Prompting."
- The Issue: A single word like "boxer" is ambiguous (is it an athlete or a dog?).
- The Fix: Using a template like "A photo of a {label}, a type of dog" provides the Text Encoder with context, significantly boosting accuracy over using raw labels.
5.5 CLIP’s Impact & Downstream Applications
CLIP is rarely used as a standalone classifier; its true impact is acting as a Multimodal Bridge:
- Image Search: Search a database of images using natural language queries without manual tagging.
- Generative AI (DALL-E / Stable Diffusion): The CLIP text encoder guides the image generation process to ensure the output matches the user's prompt.
- Robustness to Distribution Shift: Because CLIP learns language-based concepts, it is far more robust to "Sketches," "Cartoons," or distortions than models trained on standard datasets like ImageNet.
| Metric | ImageNet Model (ResNet) | CLIP (Zero-Shot) |
|---|---|---|
| Accuracy on Photos | Very High | High |
| Accuracy on Sketches | Low (Fails) | High |
| Accuracy on Distortions | Low (Fails) | High |
Tutor Insight: The "Secret Sauce" of CLIP isn't just the architecture—it's the Scale of the data and the Contrastive Loss. By comparing 400 million pairs, the model learns a "universal" understanding of how the visual world is described by humans.
6. Summary: Model Paradigms and Comparisons
The above models can be categorized by their "paradigm" — the fundamental way they process and relate to data.
6.1 Three Paradigms of Learning
-
Generative Models: These models learn the underlying probability distribution of a dataset to produce entirely new, synthetic samples. By understanding how data is "made," they can generate original content, such as GPT predicting the next token in a sentence.
-
Discriminative Models: These models learn the decision boundaries between different categories to classify or label input data. Rather than creating data, they focus on identifying what a specific input is, such as BERT classifying sentiment or ViT identifying an object in a photo.
-
Contrastive Models: These models learn to align different data types (modalities) into a shared mathematical space by maximizing the similarity of matching pairs. Instead of labeling or creating, they focus on "matchmaking" between different inputs, like CLIP connecting a caption to its corresponding image.
6.2 Comparison Table: BERT, GPT, ViT, and CLIP
| Model | Paradigm | Primary Function | Core Strength |
|---|---|---|---|
| BERT | Discriminative | Natural Language Understanding (NLU) | Deep contextual understanding of entire sentences. |
| GPT | Generative | Natural Language Generation (NLG) | Creating highly fluent, human-like original content. |
| ViT | Discriminative | Image Classification | Scaling vision tasks using the Transformer's global attention. |
| CLIP | Contrastive | Multimodal Alignment | Connecting text and images in a robust, zero-shot way. |
Tutor Insight: If you’re ever confused on an exam, remember the "Art Room" analogy. GPT is the student drawing a new picture; BERT and ViT are the judges identifying what is in the picture; and CLIP is the librarian matching the right title to the right book.