BERT (and variants) notes - USC-LHAMa/CSCI544_Project GitHub Wiki
What is BERT?
-
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT is the first self-supervised, deeply bidirectional system for pre-training NLP.
-
Self-supervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.
-
Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models (rule-based??) such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.
-
Concept - mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:
Input: the man went to the [MASK1] . he bought a [MASK2] of milk. Labels: [MASK1] = store; [MASK2] = gallon
-
Using BERT has two stages: Pre-training and fine-tuning.
-
Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only).
-
Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
-
Pre-Training
- BERT pre-trains on Masked Language Model (MLM) and Next Sentence Prediction (NSP) losses
- For specific NLP tasks like QnA, an additional layer is usually placed onto the model and then Fine Tuning technique is applied to update weights in all layers
- ALBERT replace NSP with Sentence Order Prediction (SOP) loss; more difficult than NSP, so pre-trained model performs better
Parameters
- BERT-base has 110M parameters, BERT-Large has 340M
- ALBERT-reg has 12M parameters, ALBERT-Large 18M
- Parameter sharing between layers
Code Samples
- BERT-base fine-tuning with Collab TPU - https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb
- HuggingFace Pytorch + Tensorflow for BERT-base - https://github.com/huggingface/transformers/blob/master/notebooks/Comparing-PT-and-TF-models.ipynb
Other Notes
- ALBERT paper lists out hyperparameters they used for SQuAD and other tasks; should start from those