HNMT evaluation - Helsinki-NLP/hnmt GitHub Wiki

Preliminary conclusions

Ensembling

Ensembling helps in the following ways. To summarize: use method 2 for speed or method 3 for accuracy.

A proper ensemble of the last 3 savepoints (1 hour interval) gives about 1 BLEU/chrF3 point.
An averaging "ensemble" (just average model parameters) of the last 3 savepoints gives roughly the same result as a proper ensemble.
A proper ensemble of 3 independently initialized and trained models is about 2 BLEU/chrF3 points above the baseline, and 1 point above the ensemble models above.

Various notes

Word-based decoders are pretty bad in both directions. Why?
Layer Normalization seems to hurt.
Dropout seems to hurt (Austin says this is particularly true for the decoder side).
Attention loss at most helps a little bit, perhaps not at all.
A large source vocabulary helps even with a hybrid encoder (Luong & Manning found the same thing)

newstest2015-enfi

Results are sorted by chrF3, this is the measure that correlates best with human judgement for English-Finnish according to the WMT shared task on evaluation metrics.

Configuration	BLEU	chrF3
Online-B (WMT15)	...	49.45
HNMT-run21+22+24	15.08	48.77
Google Translate (2016-11-06)	13.65	48.76
Abumatran (unconstrained WMT15)	16.0	46.89
UU (unconstrained WMT15)	14.8	45.82
HNMT-run16+17+18	12.94	45.44
HNMT-run16-average3	11.61	44.57
HNMT-run16-ensemble3	11.56	44.47
HNMT-run16	10.71	43.40
Abumatran (constrained WMT15)	13.0	45.26
HNMT-run8-ensemble4	11.21	41.81
HNMT-run8-average-ensemble4	10.75	41.83
HNMT-run8-char-long-align1.0:0.9999	10.41	41.64
HNMT-run9-char-long-align1.0:0.999	10.75	40.96
HNMT-run10-char-long-align1.0:0.999	9.23	39.50
HNMT-run5-char-long-ln	10.33	39.42
HNMT-run1-char-short-ln-svoc1k	9.00	38.01
HNMT-run7-word-ln	4.85	23.86

newstest2016-enfi

Configuration	BLEU	chrF3
Abumatran (constrained WMT16)	17.5	50.55
HNMT-run21+22+24	16.07	50.00
UH-OPUS (WMT16)	16.97	49.96
UH-factored (constrained WMT16)	13.53	47.29
HNMT-run16+17+18	14.03	46.93
HNMT-run16-average3	12.78	46.05
HNMT-run16-ensemble3	12.86	46.03
HNMT-run16	11.91	45.02
HNMT-run8-ensemble4	12.52	43.26
HNMT-run8-char-long-align1.0:0.9999	11.55	42.91

newstest2015-fien

Results are sorted by BLEU.

Configuration	BLEU	chrF3
HNMT-run12-char-long	12.38	36.70
HNMT-run13-word-long	10.47	31.95
HNMT-run14-word-long-dropout	9.92	30.90

Details

The HNMT variants used above are:

run21+22+24: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (1011 version, hopefully the final one) + 1M tokens of backtranslated news.
run16+17+18: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (0811 version, not the final one). From this point, cased and untokenized corpora are used.
ensemble4: 4 latest savepoints (3 hour intervals) ensembled
average-ensemble4: 4 latest savepoints (3 hour intervals) model parameters averaged
char: character-based decoder
word: word-based encoder (no UNK replacement or anything, always 50k target vocabulary)
long: sentence length limit (longer sentences are removed entirely): 60 words/360 chars
short: 30 words/180 chars
ln: layer normalization is used
dropout: dropout factor 0.2 is used (default is no dropout)
svocXXX: size of source vocabulary (always with character backoff, default is 10k)
alignXXX:YYY: attention loss is used, initial value XXX with exponential decay factor YYY per batch

Other parameter values:

lowercase yes (on both sides, trained on lowercased + tokenized data)
512 dim output LSTM for character-based models, 256 for word based
256 dim input LSTM
256 dim attention hidden layer
batch size 128

Example SLURM script (from run16)

#!/bin/bash -l

#SBATCH -J hnmt
#SBATCH -o hnmt.stdout.%j
#SBATCH -e hnmt.stderr.%j
#SBATCH -t 72:00:00
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --mem=16384
#SBATCH --gres=gpu:1
#SBATCH --constraint=k80
#SBATCH

module purge
module load python-env/3.4.1
module load cuda/8.0
module list

cd ${SLURM_SUBMIT_DIR:-.}
pwd

echo "Starting at `date`"

SOURCE="en"
TARGET="fi"

MODEL=/wrk/rostling/models/hnmt/run16-$SOURCE-$TARGET-70h

THEANO_FLAGS=optimizer=fast_run,device=gpu,floatX=float32 python3 \
    hnmt.py \
    --save-model "$MODEL".model \
    --log-file "$MODEL".log \
    --source /wrk/rostling/wmt16/ep-turku1m."$SOURCE" \
    --target /wrk/rostling/wmt16/ep-turku1m."$TARGET" \
    --beam-size 4 \
    --source-tokenizer word \
    --target-tokenizer char \
    --max-source-length 100 \
    --max-target-length 600 \
    --source-lowercase no \
    --target-lowercase no \
    --dropout 0 \
    --word-embedding-dims 256 \
    --char-embedding-dims 64 \
    --encoder-state-dims 256 \
    --decoder-state-dims 512 \
    --attention-dims 256 \
    --source-vocabulary 25000 \
    --min-char-count 2 \
    --batch-size 64 \
    --save-every 2500 \
    --test-every 50 \
    --translate-every 1000 \
    --training-time 71

echo "Finishing at `date`"