How Transformer Networks Work - touretzkyds/ai4k12 GitHub Wiki

Tokenization

BERT-demo

Word Embeddings

WordEmbeddingDemo

N-gram models for prediction

Bigrams
Trigrams

Attention Heads

BERT-demo

One-Layer Networks

Transformations possible with one layer of weights

Transformer Architecture

GPT-3, LaMDA, ...

Training

Word prediction training data
Fine tuning of BERT on specific tasks
GPT-3 training: https://www.youtube.com/watch?v=VPRSBzXzavo
- Generative pre-training
- Supervised fine-tuning from human examples
- RLHF (Reinforcement Learning from Human Feedback)