BERT (and variants) notes - USC-LHAMa/CSCI544_Project GitHub Wiki

"BERT"

What is BERT?

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT is the first self-supervised, deeply bidirectional system for pre-training NLP.
Self-supervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.
Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models (rule-based??) such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.
Concept - mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk. Labels: [MASK1] = store; [MASK2] = gallon

Using BERT has two stages: Pre-training and fine-tuning.
- Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only).
- Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.

BERT pre-trains on Masked Language Model (MLM) and Next Sentence Prediction (NSP) losses
For specific NLP tasks like QnA, an additional layer is usually placed onto the model and then Fine Tuning technique is applied to update weights in all layers
ALBERT replace NSP with Sentence Order Prediction (SOP) loss; more difficult than NSP, so pre-trained model performs better

BERT-base has 110M parameters, BERT-Large has 340M
ALBERT-reg has 12M parameters, ALBERT-Large 18M
- Parameter sharing between layers

ALBERT paper lists out hyperparameters they used for SQuAD and other tasks; should start from those