Tutorial: Conversion from TMX to tabulated text - mtuoc/tutorials GitHub Wiki
This tutorial explains how to use the programs found in the repository https://github.com/mtuoc/MTUOC-TMX2tabtxt to convert TMX files into tabulated text files. The programs of this repository are in the following versions:
- Versions for use in terminal.
- Versions with graphical user interface (these are versions containing GUI in their name)
The following versions can also be distinguished:
- That act on a single file.
- That treat all TMX files in a directory.
And we will have programs to do two actions:
- Detect the language codes of the TMX file or files.
- Convert the TMX file(s) into a tabulated text file.
The latest available release also distributes Windows executable versions of graphical user interface programs.
If we are not sure which languages include the TMX we want to convert or which language codes we use for each language, it is advisable to detect the language codes present in the TMX. Recall that for example English could have the codes en, en-GB, en-US, eng, etc.
The detection can be done on a single file in Terminal by means of the program MTUOC-TMXdetectlanguages.p, which with the -h
option displays the help:
python3 MTUOC-TMXdetectlanguages.py -h
usage: MTUOC-TMXdetectlanguages.py [-h] -i INPUTFILE
MTUOC program for detecting the language codes of a TMX file.
options:
-h, --help show this help message and exit
-i INPUTFILE, --in INPUTFILE
The input TMX file.
As we see, we will simply have to specify the input file using the option -i
:
python3 MTUOC-TMXdetectlanguages.py -i archivo.tmx
and the present codes will appear on screen:
en
es
We can also launch the option with graphical user interface MTUOC-TMXdetectlanguages-GUI.py or, if we are on Windows, directly its executable version MTUOC-TMXdetectlanguages-GUI.exe. An interface like the following one will appear, where we will indicate the input file by means of the Input file button and by clicking on the Go! button, the programme will show us the detected codes, as in the following image:
It is also possible to detect the language codes of all TMX files in a directory. In terminal this can be done with the MTUOC-TMXdetectlangesDIR.py program, which has the option -h
that shows the help:
python MTUOC-TMXdetectlanguagesDIR.py -h
usage: MTUOC-TMXdetectlanguagesDIR.py [-h] -d INPUTDIR
MTUOC program for detecting the language code of all TMX files in a given directory.
options:
-h, --help show this help message and exit
-d INPUTDIR, --dir INPUTDIR
The input directory where the TMX files are located.
Simply indicate the input directory using the Input dir button and by clicking on the Go! button, the programme will display all the language codes:
python MTUOC-TMXdetectlanguagesDIR.py -d directorio
en-es.tmx
gnome.tmx
en
es
en-GB
es-ES
eng
spa
We also have the graphical version MTUOC-TMXdetectlanguagesDIR-GUI.py and MTUOC-TMXdetectlanguagesDIR-GUI.exe, which has the following graphical interface where we have to indicate the input directory:
If we already know the language codes of the file or files we want to convert, we have the option of doing it in Terminal or with a graphical interface and on a single file or on the whole directory.
To convert a single file into Terminal we can use the MTUOC-TMX2tabtxt.py program that has the -h
option that shows the help:
python MTUOC-TMX2tabtxt.py -h
usage: MTUOC-TMX2tabtxt.py [-h] -i INPUTFILE -o OUTPUTFILE -s SLCODE [SLCODE ...] -t TLCODE [TLCODE ...] [--noTags]
[--simpleTags] [--noEntities] [--fixencoding]
MTUOC program for converting a TMX into a tab text.
options:
-h, --help show this help message and exit
-i INPUTFILE, --in INPUTFILE
The input TMX file.
-o OUTPUTFILE, --out OUTPUTFILE
The output text file.
-s SLCODE [SLCODE ...], --sl SLCODE [SLCODE ...]
The code for the source language.
-t TLCODE [TLCODE ...], --tl TLCODE [TLCODE ...]
The code for the target language.
--noTags Removes the internal tags.
--simpleTags Replaces tags with <t>, </t> or <t/>.
--noEntities Replaces html/xml entities by corresponding characters.
--fixencoding Tries to restore errors in encoding.
With -i
we indicate the starting file and with -o
the output file.
With -s
o --sl
we indicate the corresponding code(s) of the starting language; and with -t
or --tl
those corresponding to the language of arrival. If we want to indicate more than one code we will do it separated by spaces, such as: -s en eng en-GB en-US
.
The -noTags
option removes all HTML/XML tags from segments. The --simpleTags
option, on the other hand, replaces any tag with <t>
, </t>
or <t>
.
The option --noEntities
replaces HTML/XML entities with their corresponding characters.
The option --fixencoding
tries to repair any character encoding errors that may exist in the file.
Now we can write the command, for example:
python MTUOC-TMX2tabtxt.py -i gnome.tmx -o gnome-eng-spa.txt -s en en-GB eng -t es es-ES spa --noTags --noEntities --fixencoding
We can also use the GUI version: MTUOC-TMX2tabtxt-GUI.py or MTUOC-TMX2tabtxt-GUI.exe, which presents the following interface:
We can also use the DIR version, which treats all files in a directory:
python MTUOC-TMX2tabtxtDIR.py -d directorio -o corpus-eng-spa.txt -s en en-GB eng -t es es-ES spa --noTags --noEntities --fixencoding
or the version with graphical interface: MTUOC-TMX2tabtxtDIR-GUI.py or MTUOC-TMX2tabtxtDIR-GUI.exe, with the following graphic interface: