questions bottom up evolution - ufal/NPFL095 GitHub Wiki
The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives
-
Can you explain the formula
p(y_i | X, y_{1,i-1}, θ)mentioned in Section 2.1? Is there anything unclear? What is θ? What isy_{1,i-1}? -
Guess how would a MLM be changed if trained using 10% / 10% / 80% (MASK / random / keep) instead of 80% / 10% / 10% in Section 2.3?
-
Were you surprised by the shape of the three curves in Figure 1? Why? What were your expectations? (Note that the authors were surprised by the MLM curve, but not by the LM and MT curve.)
-
Is there any difference between the magenta (src) curve in Figure 2a and the red (LM) curve in Figure 1? And is there any difference between the magenta (src) curve in Figure 2b and the yellow (MLM) curve in Figure 1? Why?
-
Bonus: Let's have a toy dataset following the experiments in Figure1, but sampling only 4 tokens and using N=4 clusters (called A,B,C,D):
| Input | A | B | A | C |
|---|---|---|---|---|
| Layer2 | C | D | C | A |
| Layer3 | A | C | A | C |
Compute MI(In, L2) and MI(In, L3), i.e. the mutual information between the input token and Layer2/Layer3.
- Bonus: What is canonical correlation analysis (CCA or PWCCA)?