HNMT evaluation - Helsinki-NLP/hnmt GitHub Wiki

Preliminary conclusions

Ensembling

Ensembling helps in the following ways. To summarize: use method 2 for speed or method 3 for accuracy.

  1. A proper ensemble of the last 3 savepoints (1 hour interval) gives about 1 BLEU/chrF3 point.
  2. An averaging "ensemble" (just average model parameters) of the last 3 savepoints gives roughly the same result as a proper ensemble.
  3. A proper ensemble of 3 independently initialized and trained models is about 2 BLEU/chrF3 points above the baseline, and 1 point above the ensemble models above.

Various notes

  • Word-based decoders are pretty bad in both directions. Why?
  • Layer Normalization seems to hurt.
  • Dropout seems to hurt (Austin says this is particularly true for the decoder side).
  • Attention loss at most helps a little bit, perhaps not at all.
  • A large source vocabulary helps even with a hybrid encoder (Luong & Manning found the same thing)

newstest2015-enfi

Results are sorted by chrF3, this is the measure that correlates best with human judgement for English-Finnish according to the WMT shared task on evaluation metrics.

Configuration BLEU chrF3
Online-B (WMT15) ... 49.45
HNMT-run21+22+24 15.08 48.77
Google Translate (2016-11-06) 13.65 48.76
Abumatran (unconstrained WMT15) 16.0 46.89
UU (unconstrained WMT15) 14.8 45.82
HNMT-run16+17+18 12.94 45.44
HNMT-run16-average3 11.61 44.57
HNMT-run16-ensemble3 11.56 44.47
HNMT-run16 10.71 43.40
Abumatran (constrained WMT15) 13.0 45.26
HNMT-run8-ensemble4 11.21 41.81
HNMT-run8-average-ensemble4 10.75 41.83
HNMT-run8-char-long-align1.0:0.9999 10.41 41.64
HNMT-run9-char-long-align1.0:0.999 10.75 40.96
HNMT-run10-char-long-align1.0:0.999 9.23 39.50
HNMT-run5-char-long-ln 10.33 39.42
HNMT-run1-char-short-ln-svoc1k 9.00 38.01
HNMT-run7-word-ln 4.85 23.86

newstest2016-enfi

Configuration BLEU chrF3
Abumatran (constrained WMT16) 17.5 50.55
HNMT-run21+22+24 16.07 50.00
UH-OPUS (WMT16) 16.97 49.96
UH-factored (constrained WMT16) 13.53 47.29
HNMT-run16+17+18 14.03 46.93
HNMT-run16-average3 12.78 46.05
HNMT-run16-ensemble3 12.86 46.03
HNMT-run16 11.91 45.02
HNMT-run8-ensemble4 12.52 43.26
HNMT-run8-char-long-align1.0:0.9999 11.55 42.91

newstest2015-fien

Results are sorted by BLEU.

Configuration BLEU chrF3
HNMT-run12-char-long 12.38 36.70
HNMT-run13-word-long 10.47 31.95
HNMT-run14-word-long-dropout 9.92 30.90

Details

The HNMT variants used above are:

  • run21+22+24: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (1011 version, hopefully the final one) + 1M tokens of backtranslated news.
  • run16+17+18: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (0811 version, not the final one). From this point, cased and untokenized corpora are used.
  • ensemble4: 4 latest savepoints (3 hour intervals) ensembled
  • average-ensemble4: 4 latest savepoints (3 hour intervals) model parameters averaged
  • char: character-based decoder
  • word: word-based encoder (no UNK replacement or anything, always 50k target vocabulary)
  • long: sentence length limit (longer sentences are removed entirely): 60 words/360 chars
  • short: 30 words/180 chars
  • ln: layer normalization is used
  • dropout: dropout factor 0.2 is used (default is no dropout)
  • svocXXX: size of source vocabulary (always with character backoff, default is 10k)
  • alignXXX:YYY: attention loss is used, initial value XXX with exponential decay factor YYY per batch

Other parameter values:

  • lowercase yes (on both sides, trained on lowercased + tokenized data)
  • 512 dim output LSTM for character-based models, 256 for word based
  • 256 dim input LSTM
  • 256 dim attention hidden layer
  • batch size 128

Example SLURM script (from run16)

#!/bin/bash -l

#SBATCH -J hnmt
#SBATCH -o hnmt.stdout.%j
#SBATCH -e hnmt.stderr.%j
#SBATCH -t 72:00:00
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --mem=16384
#SBATCH --gres=gpu:1
#SBATCH --constraint=k80
#SBATCH

module purge
module load python-env/3.4.1
module load cuda/8.0
module list

cd ${SLURM_SUBMIT_DIR:-.}
pwd

echo "Starting at `date`"

SOURCE="en"
TARGET="fi"

MODEL=/wrk/rostling/models/hnmt/run16-$SOURCE-$TARGET-70h

THEANO_FLAGS=optimizer=fast_run,device=gpu,floatX=float32 python3 \
    hnmt.py \
    --save-model "$MODEL".model \
    --log-file "$MODEL".log \
    --source /wrk/rostling/wmt16/ep-turku1m."$SOURCE" \
    --target /wrk/rostling/wmt16/ep-turku1m."$TARGET" \
    --beam-size 4 \
    --source-tokenizer word \
    --target-tokenizer char \
    --max-source-length 100 \
    --max-target-length 600 \
    --source-lowercase no \
    --target-lowercase no \
    --dropout 0 \
    --word-embedding-dims 256 \
    --char-embedding-dims 64 \
    --encoder-state-dims 256 \
    --decoder-state-dims 512 \
    --attention-dims 256 \
    --source-vocabulary 25000 \
    --min-char-count 2 \
    --batch-size 64 \
    --save-every 2500 \
    --test-every 50 \
    --translate-every 1000 \
    --training-time 71

echo "Finishing at `date`"
⚠️ **GitHub.com Fallback** ⚠️