Questions BLEURT - ufal/NPFL095 GitHub Wiki

What function does BLEURT learn? Explain its components.
Why do traditional overlap-based metrics not always correlate well with human judgments, and how does BLEURT address these limitations?
What is meant by ‘quality drift’ mentioned in the paper? Why is it a problem for learned evaluation metrics?
How does the synthetic pre-training scheme try to anticipate certain errors produced by text generation systems?