How Transformer Networks Work - touretzkyds/ai4k12 GitHub Wiki
Tokenization
- BERT-demo
Word Embeddings
- WordEmbeddingDemo
N-gram models for prediction
- Bigrams
- Trigrams
Attention Heads
- BERT-demo
One-Layer Networks
- Transformations possible with one layer of weights
Transformer Architecture
- GPT-3, LaMDA, ...
Training
- Word prediction training data
- Fine tuning of BERT on specific tasks
- GPT-3 training: https://www.youtube.com/watch?v=VPRSBzXzavo
-
- Generative pre-training
-
- Supervised fine-tuning from human examples
-
- RLHF (Reinforcement Learning from Human Feedback)