What is the difference between ELMo and GPT when it comes to applying pre-trained representations?
Is MLM (Masked Language Model) a Language model? Why?
How many [MASK] tokens do we encounter when training a MLM on a 3300M-token corpus according to Section 3.1? What is the purpose of different replacement options for tokens, which were selected for prediction in the MLM training task?
How is the MLM different from Denoising Autoencoders?
What does NSP do that Language modeling doesn't? Is NSP needed and for which tasks if yes?
The authors mention using [CLS] token vector as a sentence embedding for classification downstream tasks. Why? What are other options for representing a sentence?