HNMT evaluation - Helsinki-NLP/hnmt GitHub Wiki
Ensembling helps in the following ways. To summarize: use method 2 for speed or method 3 for accuracy.
- A proper ensemble of the last 3 savepoints (1 hour interval) gives about 1 BLEU/chrF3 point.
- An averaging "ensemble" (just average model parameters) of the last 3 savepoints gives roughly the same result as a proper ensemble.
- A proper ensemble of 3 independently initialized and trained models is about 2 BLEU/chrF3 points above the baseline, and 1 point above the ensemble models above.
- Word-based decoders are pretty bad in both directions. Why?
- Layer Normalization seems to hurt.
- Dropout seems to hurt (Austin says this is particularly true for the decoder side).
- Attention loss at most helps a little bit, perhaps not at all.
- A large source vocabulary helps even with a hybrid encoder (Luong & Manning found the same thing)
Results are sorted by chrF3, this is the measure that correlates best with human judgement for English-Finnish according to the WMT shared task on evaluation metrics.
Configuration | BLEU | chrF3 |
---|---|---|
Online-B (WMT15) | ... | 49.45 |
HNMT-run21+22+24 | 15.08 | 48.77 |
Google Translate (2016-11-06) | 13.65 | 48.76 |
Abumatran (unconstrained WMT15) | 16.0 | 46.89 |
UU (unconstrained WMT15) | 14.8 | 45.82 |
HNMT-run16+17+18 | 12.94 | 45.44 |
HNMT-run16-average3 | 11.61 | 44.57 |
HNMT-run16-ensemble3 | 11.56 | 44.47 |
HNMT-run16 | 10.71 | 43.40 |
Abumatran (constrained WMT15) | 13.0 | 45.26 |
HNMT-run8-ensemble4 | 11.21 | 41.81 |
HNMT-run8-average-ensemble4 | 10.75 | 41.83 |
HNMT-run8-char-long-align1.0:0.9999 | 10.41 | 41.64 |
HNMT-run9-char-long-align1.0:0.999 | 10.75 | 40.96 |
HNMT-run10-char-long-align1.0:0.999 | 9.23 | 39.50 |
HNMT-run5-char-long-ln | 10.33 | 39.42 |
HNMT-run1-char-short-ln-svoc1k | 9.00 | 38.01 |
HNMT-run7-word-ln | 4.85 | 23.86 |
Configuration | BLEU | chrF3 |
---|---|---|
Abumatran (constrained WMT16) | 17.5 | 50.55 |
HNMT-run21+22+24 | 16.07 | 50.00 |
UH-OPUS (WMT16) | 16.97 | 49.96 |
UH-factored (constrained WMT16) | 13.53 | 47.29 |
HNMT-run16+17+18 | 14.03 | 46.93 |
HNMT-run16-average3 | 12.78 | 46.05 |
HNMT-run16-ensemble3 | 12.86 | 46.03 |
HNMT-run16 | 11.91 | 45.02 |
HNMT-run8-ensemble4 | 12.52 | 43.26 |
HNMT-run8-char-long-align1.0:0.9999 | 11.55 | 42.91 |
Results are sorted by BLEU.
Configuration | BLEU | chrF3 |
---|---|---|
HNMT-run12-char-long | 12.38 | 36.70 |
HNMT-run13-word-long | 10.47 | 31.95 |
HNMT-run14-word-long-dropout | 9.92 | 30.90 |
The HNMT variants used above are:
- run21+22+24: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (1011 version, hopefully the final one) + 1M tokens of backtranslated news.
- run16+17+18: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (0811 version, not the final one). From this point, cased and untokenized corpora are used.
- ensemble4: 4 latest savepoints (3 hour intervals) ensembled
- average-ensemble4: 4 latest savepoints (3 hour intervals) model parameters averaged
- char: character-based decoder
- word: word-based encoder (no UNK replacement or anything, always 50k target vocabulary)
- long: sentence length limit (longer sentences are removed entirely): 60 words/360 chars
- short: 30 words/180 chars
- ln: layer normalization is used
- dropout: dropout factor 0.2 is used (default is no dropout)
- svocXXX: size of source vocabulary (always with character backoff, default is 10k)
- alignXXX:YYY: attention loss is used, initial value XXX with exponential decay factor YYY per batch
Other parameter values:
- lowercase yes (on both sides, trained on lowercased + tokenized data)
- 512 dim output LSTM for character-based models, 256 for word based
- 256 dim input LSTM
- 256 dim attention hidden layer
- batch size 128
#!/bin/bash -l
#SBATCH -J hnmt
#SBATCH -o hnmt.stdout.%j
#SBATCH -e hnmt.stderr.%j
#SBATCH -t 72:00:00
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --mem=16384
#SBATCH --gres=gpu:1
#SBATCH --constraint=k80
#SBATCH
module purge
module load python-env/3.4.1
module load cuda/8.0
module list
cd ${SLURM_SUBMIT_DIR:-.}
pwd
echo "Starting at `date`"
SOURCE="en"
TARGET="fi"
MODEL=/wrk/rostling/models/hnmt/run16-$SOURCE-$TARGET-70h
THEANO_FLAGS=optimizer=fast_run,device=gpu,floatX=float32 python3 \
hnmt.py \
--save-model "$MODEL".model \
--log-file "$MODEL".log \
--source /wrk/rostling/wmt16/ep-turku1m."$SOURCE" \
--target /wrk/rostling/wmt16/ep-turku1m."$TARGET" \
--beam-size 4 \
--source-tokenizer word \
--target-tokenizer char \
--max-source-length 100 \
--max-target-length 600 \
--source-lowercase no \
--target-lowercase no \
--dropout 0 \
--word-embedding-dims 256 \
--char-embedding-dims 64 \
--encoder-state-dims 256 \
--decoder-state-dims 512 \
--attention-dims 256 \
--source-vocabulary 25000 \
--min-char-count 2 \
--batch-size 64 \
--save-every 2500 \
--test-every 50 \
--translate-every 1000 \
--training-time 71
echo "Finishing at `date`"