Attention is all you need - notitiam/ML-paper-notes GitHub Wiki

  • All images from Andrew Ng's Deeplearning.ai course 5: sequence models

  • Existing language models (RNN + encoder-decoder) alt text

    • Problems:
      • Single vector representation by encoder
      • Network forgets for too long sequences
  • Intuition of attention alt text

    • Self attention
    • Look at smaller window of sequence and previous output to generate next output
    • w<t,t'> => weights of input sequence t' to look at for generating output t
    • weights generated by nerual network with input of previous output + activations
    • other types: for non NLP applications alt text

    alt text

  • Transformer model

    • From 'Attention is all you need' https://arxiv.org/abs/1706.03762 alt text
    • Google brain
    • No encoder RNN, only attention
    • Feed forward arch,
      • input => activation (hiddent state)
      • activation + prev output => output
    • Multi head attention
      • For each input word, get h different attentions (h = 8 for transformer)
      • Allows focussing on different part of sentence
    • Position embedding in output
      • Encode position explicity (no recurrence)
    • Masked attention in decoder
      • Prevent looking at future output at training
    • Attention in encoder as well as decoder

    alt text

⚠️ **GitHub.com Fallback** ⚠️