Questions BLEU - ufal/NPFL095 GitHub Wiki

You can skip sections 4 and 5 in the paper.

Section 2.1.1. defines p_n as a fraction where the denominator is "the number of candidate n-grams in the test corpus". Compute this denominator for p_3 and a test corpus with three sentences with lengths 3, 4 and 5.
Do we need source-language sentences for computing BLEU?
Let's have a corpus with two sentences: Die Katze ist auf der Matte Lesegruppe ist meine Lieblingsklasse

Reference translation 1: The cat is on the mat Reading group is my favourite class

Reference translation 2: There is a cat on the mat I love RG

Machine translation: cat is cat Reading group is my nightmare

Compute BLEU and BP (for the whole corpus) of the machine translation compared to the two references. Use the standard BLEU definition, i.e. case insensitive, N=4, w_n=1/4, log(x) is the natural logarithm (ln(x)).

We computed a BLEU score for a given test set with three reference translations. Then a new reference translation became available, so we computed a new BLEU score for the same test set with four references (three old, one new). Can the new BLEU score be lower than the old score? Can it be higher? Why?
Can you think of any problems in BLEU metrics (for Czech or any other language)? Name them.