Questions Cloze - ufal/NPFL095 GitHub Wiki

Propose at least one PVP for a task of predicting the age range of the author of a given text (possible ranges being 15-25, 25-55, 55+).
Imagine a task as in the "Mia likes pie" example, i.e. a binary classification with labels y0 = contradiction, y1 = no contradiction. Let's have an unlabeled example x=(Mia likes apples, Mia hates apples) and three models M_p1, M_p2 and M_p3 (fine-tuned on PVPs p1, p2 and p3, respectively):

M_p1(v1(y0)|P1(x)) = 8
M_p1(v1(y1)|P1(x)) = 5
accuracy(p1) = 0.5

M_p2(v2(y0)|P2(x)) = 2 
M_p2(v2(y1)|P2(x)) = 1
accuracy(p2) = 0.4

M_p3(v3(y0)|P3(x)) = 20 
M_p3(v3(y1)|P3(x)) = 30
accuracy(p3) = 0.1

The training set T_C will include a pair (x, q). What is q for the weighted and for the uniform PET?

What is the difference between PML and MLM (both in general and how they are used in the paper)?
Explain why and how distillation is beneficial in PET and in iPET.
Read the last sentence of Section 3.2. Why not training directly on each x in D? Why never asking the LM to predict the masked slot?

Two bonus questions

The paper shows how PET can be used for Sentiment Analysis (Yelp restaurant rating), Text Classification (AG's News, Yahoo) and Natural Language Inference (MNLI). Can it be used also for Question Answering, Next Sentence Prediction (see the BERT paper for definitions) and Summarization? If yes, how?
PET/iPET leverages knowledge from additional unsupervised data that seems to differ from what a normal MLM learns. What would you suppose PET is learning that an MLM isn't?