What are the "authentic" and "synthetic" data in the paper? How do we obtain them? Which of these two is better for training NMT? Why?
What are the differences between ensembling models (as used e.g. in the Cloze paper) and checkpoint averaging models (as used in this paper)? Which is better and why?
Both Figures 3 and 6 show that CUBBITT produces (relatively) high quality translations. Which of the figures is more convincing (regarding the translation quality) for you and why?
Write at least three questions about the paper (e.g. about what is not clear for you).