Questions BERTnesia - ufal/NPFL095 GitHub Wiki

Why are all the graphs in Figure 2 (up to small deviances) non-decreasing?
The "Training" part of subsection 3.3 starts with the statement "We train a new decoding head for each layer the same way as standard pre-training using MLM." In total, how many decoding heads had to be trained to obtain the results of the paper?
Suppose for simplicity that we investigate a 4-layer version of BERT the way described in the paper. The goal is to fill in the [MASK] position in the question of the form "The capital of X is [MASK]". For each of the questions, we probe each layer and list the predicted top 10 most probable answers (starting with the most probable one). These are the results:

Q: "The capital of Slovakia is [MASK]."
Layer 1: Moscow, Rome, Beijing, Bratislava, Canberra, Istanbul, Prague, Ottawa, Berlin, London
Layer 2: Paris, Bratislava, Beirut, Moscow, Warsaw, Prague, Washington, Istanbul, Rome, Canberra
Layer 3: Bratislava, Paris, Washington, Prague, Canberra, Beirut, Beijing, Rome, Istanbul, Warsaw
Layer 4: Bratislava, Istanbul, Vienna, Ottawa, Rome, Beijing, Washington, Moscow, Canberra, Beirut

Q: "The capital of Austria is [MASK]."
Layer 1: Vienna, Istanbul, Ottawa, Berlin, Beijing, Canberra, London, Paris, Bratislava, Rome
Layer 2: Canberra, Ottawa, Rome, Paris, Prague, Berlin, Warsaw, Moscow, London, Vienna
Layer 3: Canberra, Ottawa, Beijing, Vienna, Warsaw, Moscow, Istanbul, Washington, Beirut, Paris
Layer 4: Canberra, Warsaw, Beijing, Bratislava, Beirut, Istanbul, Berlin, Paris, London, Ottawa

Q: "The capital of Poland is [MASK]."
Layer 1: Rome, Moscow, Beirut, Istanbul, Ottawa, Vienna, Paris, London, Prague, Washington
Layer 2: Paris, Vienna, Berlin, Ottawa, Istanbul, Moscow, Washington, Beirut, Beijing, Canberra
Layer 3: Canberra, London, Washington, Warsaw, Istanbul, Beirut, Moscow, Paris, Ottawa, Bratislava
Layer 4: Warsaw, Moscow, Beirut, Prague, Paris, Ottawa, Washington, Beijing, Berlin, Bratislava

Compute the following:

P^l @ 1 for each layer l
𝓟 @ 1
P^4 @ 10

4 (optional). What are your ideas why the Ner-CoNLL model has (out of all the considered models) the largest loss of BERT's knowledge?