Tutorial: Launch of an MTUOC server based on training with Marian - mtuoc/tutorials GitHub Wiki

1. Introduction

This tutorial explains the procedure for launching a MTUOC server from a previous training with Marian and considering that the preprocessing of the corpus has been performed with MTUOC-corpus-preprocessing using [SentencePiece]. For other configurations the procedure will be similar and some details will have to be adjusted.

2. Required files

In the tutorial we provide the links to all the necessary files of a Spanish-English training, from the NTEU corpus. We also provide some intermediate files in case someone wants to replicate all the training steps. We have used the following two corpus in particular:

These corpus have been converted from TMX to tabulated text, have been concatenated and the repetitions have been eliminated, obtaining the corpus NTEU-uniq-eng-spa.txt of 18,759,269 segments. This corpus has undergone a process of rescoring using Mtuoc-pcorpus-rescorer-txt.py and Mtuoc-pcorpus-selector-txt.py and using an index of 0.75 for all the parameters, we have obtained the corpus NTEU-rescored-075075075-eng-spa.txt of 14.484.106 segments.

For the validation corpus we have used Flores+. We have done a paste of dev.eng_Latn and dev.spa_Latn to obtain the set of validation val-eng-spa.txt.

In the following compressed folder: preprocessing-NTEU-eng-spa.zip we can obtain the necessary files from the preprocessing.

In the next compressed folder we can obtain the files resulting from the training, with the exception that we have removed most intermediate models and training corpus, to reduce the resulting size: training-NTEU-spa-eng.zip

3. Important files resulting from training

Once done in training we will have a series of very important files to be able to launch the server:

Vocabulary: vocab-es.yml and vocab-en.yml
Models resulting from different checkpoints: model.iterXXXXX.npz (where XXXXX indicates the checkpoint step).
Final models: model.npz and since we have used two validation metrics, the best models according to these metrics, in our case model.npz.best-bleu-detok.npz and model.npz.best-cross-entropy.npz.

It is common to recover a number of models.iter, such as 3, those with better values of a given validation metric, to evaluate an engine using an ensemble of these models.

This can be done with the program [getBestCheckpoint.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-Marian-training/refs/heads/main/getBestCheckpoint.py), as follows:

python3 getBestCheckpoint.py valid.log bleu-detok 3

That returns the three models with better bleu-detok:


1290000 32.636 

1270000 32.4479 

1310000 32.4363 

model.iter1290000.npz model.iter1270000.npz model.iter1310000.npz

In the training files you will find only these models, along with the finals.

4. Preprocessing necessary files

In order to launch the server we will need the following files resulting from preprocessing (and that you can download from the link indicated in the introduction):

spmodel.model
tc.es

5. Setting of the server

Now, to set up the server it is necessary to obtain the latest available version of [MTUOC-server] (https://github.com/mtuoc/MTUOC-server), doing:

git clone https://github.com/mtuoc/MTUOC-server.git

A MTUOC-server folder will be created that can be renamed to give it a name indicating which engine it is. In our case, for example, MTUOC-NTEU-spa-eng.

We enter the directory and copy the files there: model.npz, model.iter1290000.npz, model.iter1270000.npz, model.iter1310000.npz, vocab-en.yml, vocab-en.yml, spmodel.model, tc.es

We will also need the tokenizers [MTUOC-tokenizer-spa.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-tokenizers/refs/heads/main/MTUOC_tokenizer_spa.py) and MTUOC-tokenizer-eng.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-tokenizers/refs/heads/main/MTUOC

Now we have to edit the file config-server.yaml and change the lines we indicate:


MTEngine: 

  MTengine: Marian 

 

Preprocess: 

 

  truecase: upper 

  #one of always, never, upper 

  truecaser: MTUOC 

  #one of MTUOC, Moses 

  truecaser_tokenizer: MTUOC_tokenizer_spa 

  #one of None, MTUOC_tokenizer_xxx, Moses 

  tcmodel: tc.es 

 

  srxlang: Spanish 

 

Marian: 

  startMarianServer: True 

  startMarianCommand: "./marian-server-CPU -m model.npz -v vocab-es.yml vocab-en.yml -p 8250 --n-best --alignment hard --normalize 1 -b 20 --word-penalty 10 --max-length-factor 1.24 --quiet &" 

  IP: localhost

Now finally we will need marian compiled for our GPU, if we have, or a CPU version that can be downloaded from lpg.uoc.edu/marian-server-v1.11.0/marian-server-CPU. Don't forget to give him execution permits:

chmod +x marian-server-CPU

Now we can start the server:

python3 MTUOC-server.py

ATTENTION: the requirements of the requirements.txt file will have to be installed first

Once launched, the following message will appear:


2025-01-03 18:03:17.150846	Connection with Marian Server created	 

2025-01-03 18:03:17.245553	3	MTUOC server started using MTUOC protocol 

MTUOC server IP:	192.168.1.51 

MTUOC server port:	8000 

MTUOC server type:	MTUOC

Now we can use the engine.

We can also configure the engine to use a combination of the three best models, editing the line:

startMarianCommand: "./marian-server-CPU --models model.iter1290000.npz model.iter1270000.npz model.iter1310000.npz --weights 0.34 0.33 0.33 -v vocab-es.yml vocab-en.yml -p 8250 --n-best --alignment hard --normalize 1 -b 20 --word-penalty 10 --max-length-factor 1.24 --quiet &"

Remember to stop the engine before starting it again:

python3 MTUOC-stop-server.py

And get it back on track with:

python3 MTUOC-server.py