12 Transformer Variants & Foundation Models - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Modern Transformer Variants: BERT & GPT

1. BERT: Bidirectional Encoder Representations from Transformers

BERT marks a pivotal shift toward Transfer Learning in NLP. Instead of training a specialized model from scratch for every new task, we pre-train an Encoder-only architecture on a massive, unlabeled corpus to learn general linguistic patterns.

For different specific downstream applications, this pre-trained BERT model serves as a high-quality "starting point." Through Fine-Tuning, we add a small task-specific layer on top and train the entire stack on a much smaller, labeled dataset. This allows the model to leverage its existing "knowledge" of language to excel at specific tasks with far less data.

graph TD
    subgraph BERT_Architecture [BERT: Encoder-Only / Bidirectional]
        B_In[Input Tokens: CLS, The, cat, sat, SEP] --> B_Enc[BERT Encoder: 12-24 Layers]
        B_Enc --> B_Attn{Bidirectional Attention}
        B_Attn --> B_Out["Contextualized Embeddings: h_0(CLS), h_1, h_2, ..., h_n(SEP)"]
        B_Out -- "Slice at index 0" --> B_Task[Task: Classification/NER]
    end

1.1 Architecture: Encoder-Only

Structure: Composed of 12 (Base) to 24 (Large) layers of Transformer Encoders.
Key Mechanism: Bidirectional Self-Attention. Unlike traditional models that read text left-to-right, BERT looks at the entire sequence simultaneously.
Output: Generates Contextualized Embeddings ($h_{CLS}, h_1, \dots, h_{SEP}$) where the representation of a word (e.g., "bank") changes based on its surrounding context ("river bank" vs. "bank account").
Indexing: Special tokens are positioned at the boundaries: $h_{CLS}$ is always at position $0$ ($h_0$), and $h_{SEP}$ is at the final position $n$ ($h_n$).

1.2 Pre-training: Masked Language Modeling (MLM)

Standard language modeling (predicting the next word) is impossible for a bidirectional model because the model could "see" the answer in the future context. BERT solves this via Masking:

The Task: Randomly mask $15%$ of input tokens (using the [MASK] token) and force the model to predict them.
Objective: Forces the model to learn deep semantic relationships and syntax to "fill in the blanks."

Tutor Insight: Before BERT, we used static embeddings like Word2Vec. The problem? "Bank" always had the same vector regardless of meaning. BERT’s transfer learning approach changed the game because the "starting point" isn't just a vocabulary list; it's a model that understands how words interact in a sentence.

1.3 The [CLS] and [SEP] Tokens

[CLS] (Classification): A special token always placed at Position 0. Its final hidden state ($h_{CLS}$) serves as a summary of the entire sequence for sentence-level tasks.
[SEP] (Separator): Used to denote the end of a sentence or to separate two different sentences in tasks like Natural Language Inference (NLI).

1.4 Downstream Fine-Tuning

Task	Approach
Sentiment Analysis	Attach a classifier head to the `[CLS]` token output.
NER (Named Entity Recognition)	Predict a label for every individual token output ($h_1, h_2...$).
Q&A (SQuAD)	Predict the start and end token positions within the text.
Sentence Similarity	Compare the cosine similarity between two `[CLS]` embeddings.

Tutor Insight: Think of BERT as a "Context Expert." It is excellent at extracting features and understanding nuance, making it the industry standard for search, classification, and entity extraction. However, it is fundamentally bad at generating long-form text because it wasn't trained to predict the "next" word.

1.5 Deep Dive: The [CLS] Token

The [CLS] token is a "dummy" token added to the start of every input sequence. It does not represent a real word, but rather acts as a pooling mechanism for the entire sentence.

Why is it at the beginning?

Technically, because the Transformer uses Self-Attention, the [CLS] token could be placed anywhere (beginning, middle, or end) and it would still "see" every other word in the sentence. However, placing it at Position 0 offers several advantages:

Consistency: It provides a fixed, predictable location for the model to extract a single vector representing the whole sequence, regardless of how long the sentence is.
Engineering Simplicity: When writing code (e.g., in PyTorch or TensorFlow), you can always slice the output tensor at index 0 (output[:, 0, :]) to get the sentence-level representation.

How does it relate to Classification?

In a Transformer Encoder, every token in a layer attends to every other token in the previous layer. This means:

By the time the signal reaches the final layer, the hidden state of the [CLS] token ($h_{CLS}$) has "absorbed" information from every other token in the sentence via self-attention.
The Math: For a classification task (like Sentiment Analysis), we take $h_{CLS} \in \mathbb{R}^d$ and pass it through a simple linear layer with weights $W$:

$$\text{Logits} = \text{Softmax}(W \cdot h_{CLS} + b)$$

The model is trained to compress the "essence" of the entire sentence into this single vector during the pre-training phase (specifically for the Next Sentence Prediction task).

Tutor Insight: Think of the [CLS] token as the "Class Representative." While the other tokens are busy figuring out their own contextual meanings, the [CLS] token's only job is to sit in the front of the room, listen to everyone else, and summarize the collective "vibe" of the sentence for the final classifier.

2. GPT: Generative Pre-trained Transformer

Unlike the original Transformer (which had both an Encoder and a Decoder), GPT is a Decoder-only architecture. It does not receive input from an external encoder; instead, it treats the initial user prompt as the starting "context."

graph TD
    subgraph GPT_Process [GPT: Autoregressive Decoder-Only]
        direction TB
        Prompt["Input Context (Prompt): 'The cat'"] --> G_Block["GPT Decoder Stack"]
        G_Block --> Mask["Causal Masking (Prevents seeing future)"]
        Mask --> Out1["Output: 'sat'"]
        
        Out1 -- "Append to Sequence" --> Feedback["New Context: 'The cat sat'"]
        Feedback --> G_Block
        G_Block --> Out2["Output: 'on' &lt;masked&gt;"]
    end

2.1 Architecture: Decoder-Only

Mechanism: Uses Causal (Masked) Self-Attention.
No Cross-Attention: Since there is no encoder, the "Cross-Attention" layers found in the original Transformer are removed. Every layer in GPT is a self-attention layer.
Directionality: Each token can only attend to its past (left context). It is strictly prohibited from seeing the "future" during training.
Training Objective: Predict the next token ($token_t$) given all previous tokens. $$P(token_t | token_1, \dots, token_{t-1})$$

2.2 Autoregressive Generation Process

GPT generates text one step at a time, where the output of step $n$ becomes part of the input for step $n+1$.

Input (The Prompt): "The cat" (Tokens $x_1, x_2$)
Predict: The model outputs a distribution; we sample "sat" ($x_3$).
Feedback Loop: The new input sequence is now ["The", "cat", "sat"].
Repeat: The model now predicts $x_4$ based on $x_1, x_2, x_3$.

2.3 Why "Decoder-Only"?

By stripping away the encoder, GPT is optimized for unconstrained generation. While BERT is great at "filling in the blanks" (understanding), GPT is optimized for "continuing the thought" (generation).

Tutor Insight: If you see a diagram with two towers connected in the middle, that's an Encoder-Decoder (like T5 or the original Transformer for translation). If you see a single tower that only looks backward, that's a Decoder-only (GPT). The "context" isn't coming from another model; it's just the history of the conversation so far.

2.4 Causal Masking: Preventing "Cheating"

To train GPT effectively, we use a Triangular Mask in the self-attention calculation. This ensures that when the model is learning to predict the 3rd word, it physically cannot see the 4th, 5th, or 6th words in the training data.

Token	Can See
The	[The]
cat	[The, cat]
sat	[The, cat, sat]

Deep Dive: Why the "Masked" Attention in GPT? If we didn't mask the future tokens in the self-attention matrix, the model would simply "cheat" by looking at the word it is supposed to predict. The causal mask ensures the model actually learns the underlying patterns of language.

2.5 From GPT to ChatGPT: The Role of RLHF

While a standard GPT model is an expert at predicting the next word, it doesn't naturally know how to be a "helpful assistant." To bridge this gap, ChatGPT uses Reinforcement Learning from Human Feedback (RLHF).

The Intuition: Imagine the base GPT model is a student who has read the entire internet but has no manners. RLHF is like a "polishing" phase where human tutors rank the model's various answers from "Best" to "Worst."
The Goal: This process teaches the model to prioritize answers that are Helpful, Honest, and Harmless, moving it from simply "predicting text" to "following instructions."

graph LR
    Base[Base GPT Model] -- "Trained on: 'The Internet'" --> Knowledge["Result: Knowledgeable but Unpredictable"]
    Knowledge -- "Human Trainers Rank Outputs" --> RLHF{RLHF Process}
    RLHF -- "Learns Human Preferences" --> Assistant["Result: ChatGPT (Instruction Follower)"]

    style Base fill:#f9f,stroke:#333
    style Assistant fill:#bbf,stroke:#333

Tutor Insight: For your exams, remember that "Pre-training" (the next-token prediction) provides the knowledge, while "RLHF" (the reinforcement learning) provides the behavior and safety. This is why ChatGPT feels much more conversational than the raw GPT-3 model.

2.6 Scaling Laws & The GPT Family

A core finding of the GPT project is that model performance (measured by Loss) improves predictably as we scale three factors: Parameters, Data Size, and Compute.

Model	Year	Parameters	Notable Feature
GPT-1	2018	117M	Proved generative pre-training works.
GPT-2	2019	1.5B	Zero-shot capabilities emerged.
GPT-3	2020	175B	Emergent abilities: In-context learning (few-shot).
ChatGPT	2022	~GPT-3.5	Added RLHF (Reinforcement Learning from Human Feedback).
GPT-4	2023	~1T?	Multimodal and massive reasoning jump.

3. Comparison: BERT vs. GPT vs. T5

The modern NLP landscape is divided based on which part of the original Transformer architecture is utilized.

graph TD
    subgraph Comparison [Transformer Architecture Families]
        EO[Encoder-Only / BERT] -- "Optimized for" --> Und[Natural Language Understanding NLU]
        DO[Decoder-Only / GPT] -- "Optimized for" --> Gen[Natural Language Generation NLG]
        ED[Encoder-Decoder / T5] -- "Optimized for" --> Trans[Translation / Seq2Seq]
    end

    style EO fill:#d4edda,stroke:#155724
    style DO fill:#fff3cd,stroke:#856404
    style ED fill:#d1ecf1,stroke:#0c5460

Feature	BERT	GPT	T5 / BART
Architecture	Encoder-Only	Decoder-Only	Encoder-Decoder
Context	Bidirectional (Full)	Causal (Left-to-Right)	Bidirectional + Causal
Goal	Understanding (NLU)	Generation (NLG)	Sequence-to-Sequence
Pre-training	Masked Language (MLM)	Next Token Prediction	Span Masking / Denoising
Use Cases	Classification, NER, Search	Chatbots, Creative Writing	Translation, Summarization

Tutor Insight: If you are asked on an exam which model to use for Legal Document Analysis, choose BERT (it needs to understand every word in context). If asked what to use for a Poetry Generator, choose GPT. If asked for English-to-French Translation, choose an Encoder-Decoder model.

4. ViT: Vision Transformer

The Vision Transformer (ViT) represents a paradigm shift in computer vision by proving that the Transformer architecture—originally designed for NLP—is "modality-agnostic." It effectively replaces traditional Convolutional Neural Networks (CNNs) with a standard Transformer Encoder.

4.1 The Key Insight: Image Patches as Tokens

The fundamental breakthrough of ViT is treating an image exactly like a sentence.

Tokenization: An input image (e.g., $224 \times 224$) is split into fixed-size square patches (e.g., $16 \times 16$).
Visual Words: These patches are flattened and treated as a sequence of "visual words."
Mantra: "An Image is Worth 16x16 Words."

4.2 Architecture and Workflow

The architecture remains almost identical to the BERT encoder, which allows it to scale efficiently with massive datasets. The ViT architecture follows a strictly sequential flow. The image is "serialized" into a sequence of tokens, allowing the Transformer to treat visual data with the same mathematical framework used for text in BERT.

graph TD
    subgraph Input_Stage [1. Input Processing]
        Img[Input Image: 224x224x3] --> Split[Split into Fixed-size Patches: 16x16]
        Split --> Flat[Flatten Patches: 16*16*3 = 768]
    end

    subgraph Embedding_Stage [2. Linear Projection & Positional Encoding]
        Flat --> Proj[Linear Projection to Dimension D]
        Proj --> Add_CLS[Prepend Learnable CLS Token]
        Add_CLS --> Add_Pos[Add 1D Position Embeddings]
    end

    subgraph Encoder_Stage [3. Transformer Backbone]
        Add_Pos --> L_Layers[Transformer Encoder: L Layers]
        L_Layers --> MSA[Multi-Head Self-Attention]
        MSA --> Norm[Layer Normalization]
        Norm --> MLP_Inside[MLP Block]
    end

    subgraph Output_Stage [4. Head]
        MLP_Inside --> Extract["Extract h_0 (CLS Output Vector)"]
        Extract --> MLP_Head[MLP Classification Head]
        MLP_Head --> Class[Final Prediction: Image Class]
    end

    style Input_Stage fill:#f5f5f5,stroke:#333
    style Embedding_Stage fill:#e1f5fe,stroke:#01579b
    style Encoder_Stage fill:#fff3e0,stroke:#e65100
    style Output_Stage fill:#f1f8e9,stroke:#33691e

Linear Projection: Raw patches are projected into a constant latent vector size $D$ using a learnable matrix.
Special [CLS] Token: Similar to BERT, a learnable [CLS] embedding is prepended to the sequence. The final hidden state of this token ($h_{CLS}$) is used for the image classification task.
Position Embeddings: Since Transformers have no inherent sense of spatial order, learnable 1D position embeddings are added to the patch embeddings to retain the geometry of the image.
Transformer Encoder: The sequence passes through $L$ layers of Multi-Head Self-Attention (MSA) and MLP blocks.
MLP Head: A final classification head is attached to $h_{CLS}$ to predict the object category.

4.3 Patch Embedding: The Math

To convert a 3D image patch into a 1D vector that the Transformer can process, we use the following projection:

Flattening: A $16 \times 16 \times 3$ (RGB) patch is flattened into a vector $x_p$ of size $768$.
Initial Input ($z_0$): The input sequence to the first layer is defined as: $$z_0 = [x_{cls}; x_p^1 E; x_p^2 E; \dots; x_p^N E] + E_{pos}$$
Variables:
- $x_{cls}$: The learnable [CLS] token.
- $E$: The Patch Embedding projection matrix.
- $E_{pos}$: The Position Embedding matrix.

4.4 ViT vs. CNN: Inductive Bias and Scaling

CNNs (like ResNet) possess Inductive Bias (Translation Invariance and Locality). ViT lacks these, which leads to distinct performance trade-offs:

Data Hunger: ViT generally requires significantly more training data (e.g., JFT-300M) because it must learn the concept of spatial relationships from scratch.
The Tipping Point: On small datasets (ImageNet-1k), CNNs often outperform ViT. However, as data scales to "foundation model" sizes, ViT performance overtakes CNNs and scales much more efficiently.

4.5 Why ViT Works: Global Context

The primary advantage of the attention mechanism over convolution is the Global Receptive Field.

Global Attention: In a CNN, a pixel only "sees" its immediate neighbors. In ViT, every patch can attend to every other patch in the entire image from the very first layer.
Feature Evolution:
- Early Layers: The model learns to focus on both local edges and global structures simultaneously.
- Later Layers: The model focuses on complex object-level reasoning across the entire image.

Tutor Insight: Think of a CNN as someone exploring a dark room with a small flashlight—they see edges and corners first and slowly piece the room together. ViT is like someone who turns on the overhead lights instantly; they see the "big picture" immediately, but they might need to look at thousands of different rooms to understand what a "kitchen" generally looks like.

Deep Dive: Why 1D Position Embeddings? Interestingly, the ViT paper found that 2D position embeddings (explicitly telling the model X and Y coordinates) didn't provide a significant boost over simple 1D sequences. The model is powerful enough to learn the 2D structure of the image just from the 1D order!

4.6 Deep Dive: Learnable Positional Embeddings

In the original ViT paper, the authors experimented with various ways to tell the model "where" a patch is located. They reached a surprising conclusion that simplified the architecture significantly.

Learnable vs. Fixed: Unlike the original Transformer which used fixed sine/cosine functions, ViT uses learnable parameters ($E_{pos} \in \mathbb{R}^{(N+1) \times D}$). These are initialized randomly and optimized via backpropagation.
1D vs. 2D: Even though an image is 2D, ViT typically uses a 1D sequence of embeddings. You might think this would hide the 2D structure, but it doesn't.
The "Discovery": Because the model is trained on millions of images, the self-attention mechanism learns that patch #1 is spatially close to patch #2 (horizontally) and patch #17 (vertically, if the grid is 16x16).

Evidence of 2D Learning

If we visualize the Cosine Similarity between the learned positional embeddings, we see a clear 2D grid pattern:

Embeddings for patches in the same row have high similarity.
Embeddings for patches in the same column have high similarity.
The model effectively "reconstructs" the 2D map of the image inside its 1D vector space.

Tutor Insight: This is a perfect example of Representation Learning. We don't have to hard-code the rules of geometry into the model (like we do with the "sliding window" in CNNs). Given enough data, the Transformer inductively learns that images have a 2D structure because that is the most efficient way to reduce the loss function.

5. CLIP: Contrastive Language-Image Pre-training

CLIP (OpenAI, 2021) represents a shift from "learning to label" to "learning to match." Traditionally, computer vision models were limited to a fixed set of classes. CLIP breaks this by learning from 400 million image-text pairs found on the internet.

5.1 The Process: Pre-training vs. Inference

The CLIP workflow is divided into two distinct phases:

Contrastive Pre-training: The model learns a shared embedding space using a batch of $N$ image-text pairs.
- Optimizing the Diagonal: In a batch of $N$ pairs, there are $N^2$ possible combinations. However, only the pairs where the indices match (e.g., $I_1$ with $T_1$) are correct. These form the diagonal of the similarity matrix. Training focuses on maximizing these diagonal values while minimizing all off-diagonal "noise" (incorrect pairings).
Zero-Shot Prediction: At test time, the model predicts which text prompt best matches the input image vector by calculating the highest cosine similarity.

graph TD
    subgraph Training_Process [1. Training: Contrastive Alignment]
        direction TB
        Batch_I[Image Batch i=1..N] --> I_Enc[Image Encoder]
        Batch_T[Text Batch j=1..N] --> T_Enc[Text Encoder]
        I_Enc --> Matrix[NxN Similarity Matrix]
        T_Enc --> Matrix
        Matrix -- "InfoNCE Loss" --> Update[Optimize Diagonal]
    end

    %% Added a spacer link to ensure vertical stacking
    Update --- Spacer(( )) --- New_Img
    style Spacer label:none,fill:none,stroke:none

    subgraph Inference_Process [2. Inference: Zero-Shot Classification]
        direction TB
        New_Img[Input Image] --> I_Enc_Inf[Image Encoder]
        Prompts["'A photo of a cat' <br> 'A photo of a dog' <br> 'A photo of a car'"] --> T_Enc_Inf[Text Encoder]
        I_Enc_Inf --> Sim_Rank[Ranking by Cosine Similarity]
        T_Enc_Inf --> Sim_Rank
        Sim_Rank --> Top1[Predicted Label]
    end

    style Training_Process fill:#fdf2f2,stroke:#a94442
    style Inference_Process fill:#f2f9ff,stroke:#31708f

5.2 Mathematical Core: InfoNCE Loss

To train the shared space, CLIP uses the InfoNCE (Information Noise Contrastive Estimation) loss. It is essentially a Softmax over the similarity scores:

$$\mathcal{L}{i}^{(I)} = -\log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)}$$

$\text{sim}(I, T)$: Cosine similarity between vectors.
Numerator (The Match): This value is high when the image $I_i$ and text $T_i$ are semantically close. As they align in the embedding space, their cosine similarity approaches $1$, maximizing the numerator.
Denominator (The Noise): This is the sum of scores for all text prompts in the batch. It is large because it includes the correct match plus $N-1$ incorrect matches. The model learns to "push" the numerator so high that it outweighs the collective "noise" of the denominator.
Temperature ($\tau$): A learnable parameter that scales the distribution.
- A small $\tau$ (< 1) makes the distribution "sharper," forcing the model to be more confident and punishing small errors severely.
- A large $\tau$ makes the distribution "flatter," allowing for more overlap/uncertainty in the embedding space.

5.3 Why Contrastive Learning Works

Unlike "Generative" training (predicting pixels), Contrastive Learning is efficient because the model only needs to learn to distinguish concepts rather than reconstruct them from scratch.

CLIP is significantly more robust than traditional models like ResNet-101:

Generalization: CLIP maintains high accuracy on sketches, cartoons, or distorted images because it has learned the linguistic concept (e.g., "dog-ness") rather than just specific pixel textures.

Feature	Standard Computer Vision (CV)	CLIP
Labels	Fixed (e.g., 1,000 classes)	Open (Natural Language)
Learning Task	Predicting a Class Index	Matching Image to Text
Flexibility	Needs retraining for new labels	Zero-shot (No retraining)
Strength	High precision on specific tasks	High robustness and generalization

Tutor Insight: While a standard CV model is like a specialist who only knows 1,000 words, CLIP is like a generalist who has read the whole dictionary and can apply that knowledge to pictures.

5.4 Zero-Shot Classification & Prompt Engineering

CLIP's classification relies on "Prompting."

The Issue: A single word like "boxer" is ambiguous (is it an athlete or a dog?).
The Fix: Using a template like "A photo of a {label}, a type of dog" provides the Text Encoder with context, significantly boosting accuracy over using raw labels.

5.5 CLIP’s Impact & Downstream Applications

CLIP is rarely used as a standalone classifier; its true impact is acting as a Multimodal Bridge:

Image Search: Search a database of images using natural language queries without manual tagging.
Generative AI (DALL-E / Stable Diffusion): The CLIP text encoder guides the image generation process to ensure the output matches the user's prompt.
Robustness to Distribution Shift: Because CLIP learns language-based concepts, it is far more robust to "Sketches," "Cartoons," or distortions than models trained on standard datasets like ImageNet.

Metric	ImageNet Model (ResNet)	CLIP (Zero-Shot)
Accuracy on Photos	Very High	High
Accuracy on Sketches	Low (Fails)	High
Accuracy on Distortions	Low (Fails)	High

Tutor Insight: The "Secret Sauce" of CLIP isn't just the architecture—it's the Scale of the data and the Contrastive Loss. By comparing 400 million pairs, the model learns a "universal" understanding of how the visual world is described by humans.

6. Summary: Model Paradigms and Comparisons

The above models can be categorized by their "paradigm" — the fundamental way they process and relate to data.

6.1 Three Paradigms of Learning

Generative Models: These models learn the underlying probability distribution of a dataset to produce entirely new, synthetic samples. By understanding how data is "made," they can generate original content, such as GPT predicting the next token in a sentence.
Discriminative Models: These models learn the decision boundaries between different categories to classify or label input data. Rather than creating data, they focus on identifying what a specific input is, such as BERT classifying sentiment or ViT identifying an object in a photo.
Contrastive Models: These models learn to align different data types (modalities) into a shared mathematical space by maximizing the similarity of matching pairs. Instead of labeling or creating, they focus on "matchmaking" between different inputs, like CLIP connecting a caption to its corresponding image.

6.2 Comparison Table: BERT, GPT, ViT, and CLIP

Model	Paradigm	Primary Function	Core Strength
BERT	Discriminative	Natural Language Understanding (NLU)	Deep contextual understanding of entire sentences.
GPT	Generative	Natural Language Generation (NLG)	Creating highly fluent, human-like original content.
ViT	Discriminative	Image Classification	Scaling vision tasks using the Transformer's global attention.
CLIP	Contrastive	Multimodal Alignment	Connecting text and images in a robust, zero-shot way.

Tutor Insight: If you’re ever confused on an exam, remember the "Art Room" analogy. GPT is the student drawing a new picture; BERT and ViT are the judges identifying what is in the picture; and CLIP is the librarian matching the right title to the right book.