TrainingEuroparl - SamuelLarkin/LizzyConversion GitHub Wiki

Up: PortageII Previous: AnnotatedBibliography Next: FrequentlyAskedQuestions


'''Warning''': this page is out of date, as the statistics were gathered years ago for Portage 1.4.2. Since then, machines are much faster, but PORTAGE shared is more complex. These statistics are still a potentially useful indication, but not necessarily reliable.

##Training a mid-size system (what kind of machine do I need) You can train a mid-size system using the corpora available from http://www.statmt.org/wmt09/translation-task.html. More precisely, download the following:

Rename your file appropriately and place them under corpora/ in the framework. Here are the size of each file.

#Lines      #Words        #Char         shortest line   longest line   mean      sdev         filename
L:2000      W:53088       C:317356      s:1             l:107          m:26.54   sdev:15.13   dev2006_en.raw
L:2000      W:55088       C:357257      s:1             l:114          m:27.54   sdev:15.82   dev2006_fr.raw
L:2000      W:52388       C:312927      s:1             l:119          m:26.19   sdev:15.07   devtest2006_en.raw
L:2000      W:54149       C:352709      s:1             l:113          m:27.07   sdev:15.45   devtest2006_fr.raw
L:2000      W:53627       C:320432      s:1             l:119          m:26.81   sdev:15.49   test2006_en.raw
L:2000      W:55764       C:361702      s:1             l:116          m:27.88   sdev:16.05   test2006_fr.raw
L:2000      W:53535       C:318972      s:1             l:112          m:26.77   sdev:15.38   test2007_en.raw
L:2000      W:55157       C:358557      s:1             l:114          m:27.58   sdev:15.73   test2007_fr.raw
L:2000      W:54301       C:324417      s:1             l:125          m:27.15   sdev:15.77   test2008_en.raw
L:2000      W:56217       C:365342      s:1             l:129          m:28.11   sdev:16.66   test2008_fr.raw
L:1658841   W:40624075    C:243409294   s:1             l:668          m:24.49   sdev:14.80   europarl-v4_en.raw.gz
L:1428799   W:36198774    C:216718721   s:0             l:407          m:25.34   sdev:15.25   europarl-v4.fr-en_en.raw.gz
L:1428799   W:37649967    C:249125357   s:0             l:465          m:26.35   sdev:15.96   europarl-v4.fr-en_fr.raw.gz
L:1676435   W:42618754    C:281943473   s:1             l:465          m:25.42   sdev:15.45   europarl-v4_fr.raw.gz

L:6212874   W:157634884   C:994586516   s:0             l:668          m:25.37   sdev:15.37   TOTAL

Without rescoring and Lexicalized Distortion Model

Changes to your Makefile.params

Configure you Makefile.params located at the root of the framework as follow:

SRC_LANG ?= en
TGT_LANG ?= fr

TRAIN_LM      ?= europarl-v4
TRAIN_TM      ?= europarl-v4.fr-en
TUNE_DECODE   ?= dev2006
TUNE_RESCORE  ?= devtest2006
TUNE_CE       ?= test2006
TEST_SET      ?= test2008

LM_TOOLKIT = SRI
#DO_RESCORING = 1
DO_CE = 1
DO_TRUECASING = 1
ICU = 1
#USE_LDM =

BLEU score

You should obtain a BLEU similar to this:

Human readable value: 35.28 +/- 0.96

Resources Required

You can find out how much resources that was required by running the following command from the root of the framework:

make summary

Portage 1.4.2, NRC-CNRC, (c) 2004 - 2010, Her Majesty in Right of Canada
Please run "portage_info -notice" for Copyright notices of 3rd party libraries.

Running in cluster mode.

     log.europarl-v4_fr-kn-5g.lm:TIME-MEM                  WALL TIME: 4m34s       CPU TIME: 4m2s          VSZ: 3.508G    RSS: 3.457G
     log.europarl-v4_en-kn-5g.lm:TIME-MEM                  WALL TIME: 6m19s       CPU TIME: 5m53s         VSZ: 3.362G    RSS: 3.312G
     log.europarl-v4_en-kn-5g.binlm:TIME-MEM               WALL TIME: 6m47s       CPU TIME: 5m38s         VSZ: 0.364G    RSS: 0.311G
     log.europarl-v4_fr-kn-5g.binlm:TIME-MEM               WALL TIME: 4m27s       CPU TIME: 4m24s         VSZ: 0.404G    RSS: 0.351G
  lm:TIME-MEM                                              WALL TIME: 22m7s       CPU TIME: 19m57s        VSZ: 3.508G    RSS: 3.457G
     log.jpt.ibm2.europarl-v4.fr-en.en-fr:TIME-MEM         WALL TIME: 9m31s       CPU TIME: 37m39s        VSZ: 1.827G    RSS: 1.580G
     log.jpt.hmm3.europarl-v4.fr-en.en-fr:TIME-MEM         WALL TIME: 34m1s       CPU TIME: 1h58m0s       VSZ: 5.659G    RSS: 4.951G
     log.hmm3.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 2h10m47s    CPU TIME: 12h30m10s     VSZ: 0.507G    RSS: 0.271G
     log.cpt.hmm3-rf-zn.europarl-v4.fr-en.en2fr:TIME-MEM   WALL TIME: 2h26m57s    CPU TIME: 2h35m9s       VSZ: 10.141G   RSS: 8.819G
     log.ibm1.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 25m3s       CPU TIME: 1h35m26s      VSZ: 1.048G    RSS: 0.812G
     log.ibm1.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 22m24s      CPU TIME: 1h31m30s      VSZ: 1.056G    RSS: 0.812G
     log.hmm3.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 2h33m31s    CPU TIME: 13h59m34s     VSZ: 0.501G    RSS: 0.265G
     log.ibm2.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 14m32s      CPU TIME: 1h8m19s       VSZ: 0.496G    RSS: 0.261G
     log.ibm2.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 13m35s      CPU TIME: 1h8m15s       VSZ: 0.564G    RSS: 0.266G
     log.cpt.ibm2-rf-zn.europarl-v4.fr-en.en2fr:TIME-MEM   WALL TIME: 16m47s      CPU TIME: 19m25s        VSZ: 2.901G    RSS: 2.531G
  tm:TIME-MEM                                              WALL TIME: 9h27m8s     CPU TIME: 1d13h23m25s   VSZ: 10.141G   RSS: 8.819G
     log.europarl-v4_fr-kn-3g.binlm:TIME-MEM               WALL TIME: 1m7s        CPU TIME: 1m3s          VSZ: 0.175G    RSS: 0.121G
     log.europarl-v4_fr.map:TIME-MEM                       WALL TIME: 55s         CPU TIME: 40s           VSZ: 0.086G    RSS: 0.043G
     log.europarl-v4_fr-kn-3g.lm:TIME-MEM                  WALL TIME: 1m31s       CPU TIME: 1m42s         VSZ: 0.746G    RSS: 0.698G
  tc:TIME-MEM                                              WALL TIME: 3m33s       CPU TIME: 3m26s         VSZ: 0.746G    RSS: 0.698G
     log.canoe.ini.cow:TIME-MEM                            WALL TIME: 4h26m30s    CPU TIME: 12h44m46s     VSZ: 1.659G    RSS: 1.547G
  decode:TIME-MEM                                          WALL TIME: 4h26m30s    CPU TIME: 12h44m46s     VSZ: 1.659G    RSS: 1.547G
     log.ce_model.cem:TIME-MEM                             WALL TIME: 12m28s      CPU TIME: 2h59m13s      VSZ: 1.031G    RSS: 0.758G
  confidence:TIME-MEM                                      WALL TIME: 12m28s      CPU TIME: 2h59m13s      VSZ: 1.031G    RSS: 0.758G
models:TIME-MEM                                            WALL TIME: 14h31m46s   CPU TIME: 2d5h30m46s    VSZ: 10.141G   RSS: 8.819G
  ./log.test2008.ce:TIME-MEM                               WALL TIME: 17m55s      CPU TIME: 3h17m38s      VSZ: 1.369G    RSS: 1.064G
  ./log.test2008.out:TIME-MEM                              WALL TIME: 17m4s       CPU TIME: 3h1m56s       VSZ: 1.287G    RSS: 1.195G
  ./log.test2008.out.tc:TIME-MEM                           WALL TIME: 27s         CPU TIME: 4s            VSZ: 0.047G    RSS: 0.009G
translate:TIME-MEM                                         WALL TIME: 35m26s      CPU TIME: 6h19m39s      VSZ: 1.369G    RSS: 1.195G

686M    corpora
96K     models/confidence/ce_model.cem
40K     models/ldm
97M     models/lm/europarl-v4_en-kn-5g.binlm.gz
112M    models/lm/europarl-v4_en-kn-5g.lm.gz
108M    models/lm/europarl-v4_fr-kn-5g.binlm.gz
123M    models/lm/europarl-v4_fr-kn-5g.lm.gz
108M    models/tm/ibm1.europarl-v4.fr-en.en_given_fr.gz
106M    models/tm/ibm1.europarl-v4.fr-en.fr_given_en.gz
85M     models/tm/ibm2.europarl-v4.fr-en.en_given_fr.gz
176K    models/tm/ibm2.europarl-v4.fr-en.en_given_fr.pos.gz
88M     models/tm/ibm2.europarl-v4.fr-en.fr_given_en.gz
176K    models/tm/ibm2.europarl-v4.fr-en.fr_given_en.pos.gz
4.0K    models/tm/hmm3.europarl-v4.fr-en.en_given_fr.dist.gz
50M     models/tm/hmm3.europarl-v4.fr-en.en_given_fr.gz
4.0K    models/tm/hmm3.europarl-v4.fr-en.fr_given_en.dist.gz
52M     models/tm/hmm3.europarl-v4.fr-en.fr_given_en.gz
1.1G    models/tm/jpt.hmm3.europarl-v4.fr-en.en-fr.gz
264M    models/tm/jpt.ibm2.europarl-v4.fr-en.en-fr.gz
3.6G    models/tm/cpt.hmm3-rf-zn.europarl-v4.fr-en.en2fr.gz
676M    models/tm/cpt.ibm2-rf-zn.europarl-v4.fr-en.en2fr.gz
101M    models/tc
2.8M    translate
7.3G    total

From the previous table, we see that, to train a system using Europarl's data we need 10.141G or more of memory. It took 14 hours 31 minutes and 46 seconds on our multicore machine. It would have taken close to 2 days 5 hours 30 minutes and 46 seconds to train the same system if we only had one core. This mid-size system requires at least 7.3G of disk space.

With rescoring and Lexicalized Distortion Model

Changes to your Makefile.params

Configure you Makefile.params located at the root of the framework as follow:

SRC_LANG ?= en
TGT_LANG ?= fr

TRAIN_LM      ?= europarl-v4
TRAIN_TM      ?= europarl-v4.fr-en
TUNE_DECODE   ?= dev2006
TUNE_RESCORE  ?= devtest2006
TUNE_CE       ?= test2006
TEST_SET      ?= test2008

LM_TOOLKIT = SRI
DO_RESCORING = 1
DO_CE = 1
DO_TRUECASING = 1
ICU = 1
USE_LDM = 1

BLEU score

You should obtain a BLEU similar to this:

test2008.out.bleu:Human readable value: 34.87 +/- 0.94
test2008.rat.bleu:Human readable value: 34.82 +/- 0.94

Resources Required

You can find out how much resources that was required by running the following command from the root of the framework:

make summary

Portage 1.4.2, NRC-CNRC, (c) 2004 - 2010, Her Majesty in Right of Canada
Please run "portage_info -notice" for Copyright notices of 3rd party libraries.

Running in cluster mode.

Resource summary for /home/larkins/exp/uqam:
         log.ce_model.cem:TIME-MEM                             WALL TIME: 22m21s      CPU TIME: 5h24m14s      VSZ: 1.003G    RSS: 0.698G
      confidence:TIME-MEM                                      WALL TIME: 22m21s      CPU TIME: 5h24m14s      VSZ: 1.003G    RSS: 0.698G
         log.canoe.ini.cow:TIME-MEM                            WALL TIME: 6h41m15s    CPU TIME: 1d1h19m3s     VSZ: 2.226G    RSS: 2.148G
      decode:TIME-MEM                                          WALL TIME: 6h41m15s    CPU TIME: 1d1h19m3s     VSZ: 2.226G    RSS: 2.148G
         log.ldm.counts.hmm3:TIME-MEM                          WALL TIME: 14m48s      CPU TIME: 2h19m48s      VSZ: 2.372G    RSS: 2.215G
         log.ldm.counts.ibm2:TIME-MEM                          WALL TIME: 3m19s       CPU TIME: 35m44s        VSZ: 1.042G    RSS: 0.845G
         log.ldm.hmm3+ibm2.en2fr:TIME-MEM                      WALL TIME: 1h28m37s    CPU TIME: 55m8s         VSZ: 32.946G   RSS: 29.570G
      ldm:TIME-MEM                                             WALL TIME: 1h46m44s    CPU TIME: 3h50m40s      VSZ: 32.946G   RSS: 29.570G
         log.europarl-v4_en-kn-5g.binlm:TIME-MEM               WALL TIME: 6m47s       CPU TIME: 5m38s         VSZ: 0.364G    RSS: 0.311G
         log.europarl-v4_en-kn-5g.lm:TIME-MEM                  WALL TIME: 6m19s       CPU TIME: 5m53s         VSZ: 3.362G    RSS: 3.312G
         log.europarl-v4_fr-kn-5g.binlm:TIME-MEM               WALL TIME: 4m27s       CPU TIME: 4m24s         VSZ: 0.404G    RSS: 0.351G
         log.europarl-v4_fr-kn-5g.lm:TIME-MEM                  WALL TIME: 4m34s       CPU TIME: 4m2s          VSZ: 3.508G    RSS: 3.457G
      lm:TIME-MEM                                              WALL TIME: 22m7s       CPU TIME: 19m57s        VSZ: 3.508G    RSS: 3.457G
         log.rescore-model:TIME-MEM                            WALL TIME: 1h19m22s    CPU TIME: 5h43m7s       VSZ: 1.633G    RSS: 1.513G
      rescore:TIME-MEM                                         WALL TIME: 1h19m22s    CPU TIME: 5h43m7s       VSZ: 1.633G    RSS: 1.513G
         log.europarl-v4_fr-kn-3g.binlm:TIME-MEM               WALL TIME: 1m7s        CPU TIME: 1m3s          VSZ: 0.175G    RSS: 0.121G
         log.europarl-v4_fr-kn-3g.lm:TIME-MEM                  WALL TIME: 1m31s       CPU TIME: 1m42s         VSZ: 0.746G    RSS: 0.698G
         log.europarl-v4_fr.map:TIME-MEM                       WALL TIME: 55s         CPU TIME: 40s           VSZ: 0.086G    RSS: 0.043G
      tc:TIME-MEM                                              WALL TIME: 3m33s       CPU TIME: 3m26s         VSZ: 0.746G    RSS: 0.698G
         log.cpt.hmm3-rf-zn.europarl-v4.fr-en.en2fr:TIME-MEM   WALL TIME: 2h26m57s    CPU TIME: 2h35m9s       VSZ: 10.141G   RSS: 8.819G
         log.cpt.ibm2-rf-zn.europarl-v4.fr-en.en2fr:TIME-MEM   WALL TIME: 16m47s      CPU TIME: 19m25s        VSZ: 2.901G    RSS: 2.531G
         log.hmm3.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 2h10m47s    CPU TIME: 12h30m10s     VSZ: 0.507G    RSS: 0.271G
         log.hmm3.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 2h33m31s    CPU TIME: 13h59m34s     VSZ: 0.501G    RSS: 0.265G
         log.ibm1.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 25m3s       CPU TIME: 1h35m26s      VSZ: 1.048G    RSS: 0.812G
         log.ibm1.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 22m24s      CPU TIME: 1h31m30s      VSZ: 1.056G    RSS: 0.812G
         log.ibm2.europarl-v4.fr-en.en_given_fr:TIME-MEM       WALL TIME: 13m35s      CPU TIME: 1h8m15s       VSZ: 0.564G    RSS: 0.266G
         log.ibm2.europarl-v4.fr-en.fr_given_en:TIME-MEM       WALL TIME: 14m32s      CPU TIME: 1h8m19s       VSZ: 0.496G    RSS: 0.261G
         log.jpt.hmm3.europarl-v4.fr-en.en-fr:TIME-MEM         WALL TIME: 34m1s       CPU TIME: 1h58m0s       VSZ: 5.659G    RSS: 4.951G
         log.jpt.ibm2.europarl-v4.fr-en.en-fr:TIME-MEM         WALL TIME: 9m31s       CPU TIME: 37m39s        VSZ: 1.827G    RSS: 1.580G
      tm:TIME-MEM                                              WALL TIME: 9h27m8s     CPU TIME: 1d13h23m25s   VSZ: 10.141G   RSS: 8.819G
   models:TIME-MEM                                             WALL TIME: 20h2m30s    CPU TIME: 3d6h3m52s     VSZ: 32.946G   RSS: 29.570G
      log.test2008.ce:TIME-MEM                                 WALL TIME: 39m29s      CPU TIME: 6h30m43s      VSZ: 1.097G    RSS: 0.797G
      log.test2008.rat:TIME-MEM                                WALL TIME: 1h2m17s     CPU TIME: 6h18m17s      VSZ: 1.672G    RSS: 1.552G
      log.test2008.rat.tc:TIME-MEM                             WALL TIME: 12s         CPU TIME: 5s            VSZ: 0.048G    RSS: 0.009G
   translate:TIME-MEM                                          WALL TIME: 1h41m58s    CPU TIME: 12h49m5s      VSZ: 1.672G    RSS: 1.552G
TIME-MEM                                                       WALL TIME: 21h44m28s   CPU TIME: 3d18h52m57s   VSZ: 32.946G   RSS: 29.570G


Disk usage for all models:
96K   models/confidence/ce_model.cem
4.9G  models/ldm
97M   models/lm/europarl-v4_en-kn-5g.binlm.gz
112M  models/lm/europarl-v4_en-kn-5g.lm.gz
108M  models/lm/europarl-v4_fr-kn-5g.binlm.gz
123M  models/lm/europarl-v4_fr-kn-5g.lm.gz
108M  models/tm/ibm1.europarl-v4.fr-en.en_given_fr.gz
106M  models/tm/ibm1.europarl-v4.fr-en.fr_given_en.gz
85M   models/tm/ibm2.europarl-v4.fr-en.en_given_fr.gz
176K  models/tm/ibm2.europarl-v4.fr-en.en_given_fr.pos.gz
88M   models/tm/ibm2.europarl-v4.fr-en.fr_given_en.gz
176K  models/tm/ibm2.europarl-v4.fr-en.fr_given_en.pos.gz
4.0K  models/tm/hmm3.europarl-v4.fr-en.en_given_fr.dist.gz
50M   models/tm/hmm3.europarl-v4.fr-en.en_given_fr.gz
4.0K  models/tm/hmm3.europarl-v4.fr-en.fr_given_en.dist.gz
52M   models/tm/hmm3.europarl-v4.fr-en.fr_given_en.gz
1.1G  models/tm/jpt.hmm3.europarl-v4.fr-en.en-fr.gz
264M  models/tm/jpt.ibm2.europarl-v4.fr-en.en-fr.gz
3.6G  models/tm/cpt.hmm3-rf-zn.europarl-v4.fr-en.en2fr.gz
676M  models/tm/cpt.ibm2-rf-zn.europarl-v4.fr-en.en2fr.gz
101M  models/tc
267M  translate
12G   total

Up: PortageII Previous: AnnotatedBibliography Next: FrequentlyAskedQuestions