Questions Pretraining Language Models with Human Preferences - ufal/NPFL095 GitHub Wiki
-
According to this paper, what method for pretraining with human feedback is the optimal one? How does it work?
-
Section 2 defines six objective functions (MLE, MLE with Filtering, Conditional Training, Unlikelihood, RWR, AWR).
Which of these six objective functions can result in negative values for "toxic" training examples x, i.e. examples where R(x) < t?What is the influence of R(x) (the difference between R(x)<t case and the R(x)>t case) on the values of L(x) for the six versions of L(x) (supposing the x would remain the same)? -
How does the nucleus sampling strategy for text generation work?
-
According to Figure 2, UL (Unlikelihood) has a low misalignment score (i.e. low frequency of undesirable=toxic content) for the Toxicity task, but high for PII and PEP8. At the same time, UL has a very high KL divergence from GPT-3 (i.e. very low "LM’s general capabilities", according to the paper) for the Toxicity task, but very low for PII. Do you have any hypotheses why?