Questions Transformer - ufal/NPFL095 GitHub Wiki

Questions Transformer

Is Transformer faster than RNN-based sequence-to-sequence models? When and why yes (or not)? Hint: RNN-based models are the recurrent models (such as LSTM or GRU) mentioned in the Introduction.
What does "auto-regressive" mean? Is Transformer auto-regressive?
Section 3.2.3 states that "In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder." So can we in this case simplify Equation (1) from Section 3.2.1 into Attention(L) = softmax(LL^T/sqrt(d_k))*L, where L is the output of the previous layer? Why?
BONUS: Why is the positional encoding added to word embeddings instead of concatenation?