How Transformer Networks Work - touretzkyds/ai4k12 GitHub Wiki

Tokenization

  • BERT-demo

Word Embeddings

  • WordEmbeddingDemo

N-gram models for prediction

  • Bigrams
  • Trigrams

Attention Heads

  • BERT-demo

One-Layer Networks

  • Transformations possible with one layer of weights

Transformer Architecture

  • GPT-3, LaMDA, ...

Training

  • Word prediction training data
  • Fine tuning of BERT on specific tasks
  • GPT-3 training: https://www.youtube.com/watch?v=VPRSBzXzavo
    • Generative pre-training
    • Supervised fine-tuning from human examples
    • RLHF (Reinforcement Learning from Human Feedback)