BERT (and variants) notes - USC-LHAMa/CSCI544_Project GitHub Wiki

"BERT"

What is BERT?

  • BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT is the first self-supervised, deeply bidirectional system for pre-training NLP.

  • Self-supervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.

  • Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models (rule-based??) such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.

  • Concept - mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk. Labels: [MASK1] = store; [MASK2] = gallon

  • Using BERT has two stages: Pre-training and fine-tuning.

    • Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only).

    • Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.

Pre-Training

  • BERT pre-trains on Masked Language Model (MLM) and Next Sentence Prediction (NSP) losses
  • For specific NLP tasks like QnA, an additional layer is usually placed onto the model and then Fine Tuning technique is applied to update weights in all layers
  • ALBERT replace NSP with Sentence Order Prediction (SOP) loss; more difficult than NSP, so pre-trained model performs better

Parameters

  • BERT-base has 110M parameters, BERT-Large has 340M
  • ALBERT-reg has 12M parameters, ALBERT-Large 18M
    • Parameter sharing between layers

Code Samples

Other Notes

  • ALBERT paper lists out hyperparameters they used for SQuAD and other tasks; should start from those

Reference Papers