Tutorial: Launch of an MTUOC server based on training with Marian - mtuoc/tutorials GitHub Wiki
1. Introduction
This tutorial explains the procedure for launching a MTUOC server from a previous training with Marian and considering that the preprocessing of the corpus has been performed with MTUOC-corpus-preprocessing using [SentencePiece]. For other configurations the procedure will be similar and some details will have to be adjusted.
2. Required files
In the tutorial we provide the links to all the necessary files of a Spanish-English training, from the NTEU corpus. We also provide some intermediate files in case someone wants to replicate all the training steps. We have used the following two corpus in particular:
These corpus have been converted from TMX to tabulated text, have been concatenated and the repetitions have been eliminated, obtaining the corpus NTEU-uniq-eng-spa.txt of 18,759,269 segments. This corpus has undergone a process of rescoring using Mtuoc-pcorpus-rescorer-txt.py and Mtuoc-pcorpus-selector-txt.py and using an index of 0.75 for all the parameters, we have obtained the corpus NTEU-rescored-075075075-eng-spa.txt of 14.484.106 segments.
For the validation corpus we have used Flores+. We have done a paste
of dev.eng_Latn and dev.spa_Latn to obtain the set of validation val-eng-spa.txt.
In the following compressed folder: preprocessing-NTEU-eng-spa.zip we can obtain the necessary files from the preprocessing.
In the next compressed folder we can obtain the files resulting from the training, with the exception that we have removed most intermediate models and training corpus, to reduce the resulting size: training-NTEU-spa-eng.zip
3. Important files resulting from training
Once done in training we will have a series of very important files to be able to launch the server:
-
Vocabulary: vocab-es.yml and vocab-en.yml
-
Models resulting from different checkpoints: model.iterXXXXX.npz (where XXXXX indicates the checkpoint step).
-
Final models: model.npz and since we have used two validation metrics, the best models according to these metrics, in our case model.npz.best-bleu-detok.npz and model.npz.best-cross-entropy.npz.
It is common to recover a number of models.iter, such as 3, those with better values of a given validation metric, to evaluate an engine using an ensemble of these models.
This can be done with the program [getBestCheckpoint.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-Marian-training/refs/heads/main/getBestCheckpoint.py), as follows:
python3 getBestCheckpoint.py valid.log bleu-detok 3
That returns the three models with better bleu-detok:
1290000 32.636
1270000 32.4479
1310000 32.4363
model.iter1290000.npz model.iter1270000.npz model.iter1310000.npz
In the training files you will find only these models, along with the finals.
4. Preprocessing necessary files
In order to launch the server we will need the following files resulting from preprocessing (and that you can download from the link indicated in the introduction):
-
spmodel.model
-
tc.es
5. Setting of the server
Now, to set up the server it is necessary to obtain the latest available version of [MTUOC-server] (https://github.com/mtuoc/MTUOC-server), doing:
git clone https://github.com/mtuoc/MTUOC-server.git
A MTUOC-server folder will be created that can be renamed to give it a name indicating which engine it is. In our case, for example, MTUOC-NTEU-spa-eng.
We enter the directory and copy the files there: model.npz, model.iter1290000.npz, model.iter1270000.npz, model.iter1310000.npz, vocab-en.yml, vocab-en.yml, spmodel.model, tc.es
We will also need the tokenizers [MTUOC-tokenizer-spa.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-tokenizers/refs/heads/main/MTUOC_tokenizer_spa.py) and MTUOC-tokenizer-eng.py] (https://raw.githubusercontent.com/mtuoc/MTUOC-tokenizers/refs/heads/main/MTUOC
Now we have to edit the file config-server.yaml and change the lines we indicate:
MTEngine:
MTengine: Marian
Preprocess:
truecase: upper
#one of always, never, upper
truecaser: MTUOC
#one of MTUOC, Moses
truecaser_tokenizer: MTUOC_tokenizer_spa
#one of None, MTUOC_tokenizer_xxx, Moses
tcmodel: tc.es
srxlang: Spanish
Marian:
startMarianServer: True
startMarianCommand: "./marian-server-CPU -m model.npz -v vocab-es.yml vocab-en.yml -p 8250 --n-best --alignment hard --normalize 1 -b 20 --word-penalty 10 --max-length-factor 1.24 --quiet &"
IP: localhost
Now finally we will need marian compiled for our GPU, if we have, or a CPU version that can be downloaded from lpg.uoc.edu/marian-server-v1.11.0/marian-server-CPU. Don't forget to give him execution permits:
chmod +x marian-server-CPU
Now we can start the server:
python3 MTUOC-server.py
ATTENTION: the requirements of the requirements.txt file will have to be installed first
Once launched, the following message will appear:
2025-01-03 18:03:17.150846 Connection with Marian Server created
2025-01-03 18:03:17.245553 3 MTUOC server started using MTUOC protocol
MTUOC server IP: 192.168.1.51
MTUOC server port: 8000
MTUOC server type: MTUOC
Now we can use the engine.
We can also configure the engine to use a combination of the three best models, editing the line:
startMarianCommand: "./marian-server-CPU --models model.iter1290000.npz model.iter1270000.npz model.iter1310000.npz --weights 0.34 0.33 0.33 -v vocab-es.yml vocab-en.yml -p 8250 --n-best --alignment hard --normalize 1 -b 20 --word-penalty 10 --max-length-factor 1.24 --quiet &"
Remember to stop the engine before starting it again:
python3 MTUOC-stop-server.py
And get it back on track with:
python3 MTUOC-server.py