Tutorial: Conversion from TMX to tabulated text - mtuoc/tutorials GitHub Wiki

1. Introduction

This tutorial explains how to use the programs found in the repository https://github.com/mtuoc/MTUOC-TMX2tabtxt to convert TMX files into tabulated text files. The programs of this repository are in the following versions:

  • Versions for use in terminal.
  • Versions with graphical user interface (these are versions containing GUI in their name)

The following versions can also be distinguished:

  • That act on a single file.
  • That treat all TMX files in a directory.

And we will have programs to do two actions:

  • Detect the language codes of the TMX file or files.
  • Convert the TMX file(s) into a tabulated text file.

The latest available release also distributes Windows executable versions of graphical user interface programs.

2. Detection of language codes

If we are not sure which languages include the TMX we want to convert or which language codes we use for each language, it is advisable to detect the language codes present in the TMX. Recall that for example English could have the codes en, en-GB, en-US, eng, etc.

The detection can be done on a single file in Terminal by means of the program MTUOC-TMXdetectlanguages.p, which with the -h option displays the help:

python3 MTUOC-TMXdetectlanguages.py -h
usage: MTUOC-TMXdetectlanguages.py [-h] -i INPUTFILE

MTUOC program for detecting the language codes of a TMX file.

options:
  -h, --help            show this help message and exit
  -i INPUTFILE, --in INPUTFILE
                        The input TMX file.

As we see, we will simply have to specify the input file using the option -i:

python3 MTUOC-TMXdetectlanguages.py -i archivo.tmx

and the present codes will appear on screen:

en
es

We can also launch the option with graphical user interface MTUOC-TMXdetectlanguages-GUI.py or, if we are on Windows, directly its executable version MTUOC-TMXdetectlanguages-GUI.exe. An interface like the following one will appear, where we will indicate the input file by means of the Input file button and by clicking on the Go! button, the programme will show us the detected codes, as in the following image:

It is also possible to detect the language codes of all TMX files in a directory. In terminal this can be done with the MTUOC-TMXdetectlangesDIR.py program, which has the option -h that shows the help:

python MTUOC-TMXdetectlanguagesDIR.py -h
usage: MTUOC-TMXdetectlanguagesDIR.py [-h] -d INPUTDIR

MTUOC program for detecting the language code of all TMX files in a given directory.

options:
  -h, --help            show this help message and exit
  -d INPUTDIR, --dir INPUTDIR
                        The input directory where the TMX files are located.

Simply indicate the input directory using the Input dir button and by clicking on the Go! button, the programme will display all the language codes:

python MTUOC-TMXdetectlanguagesDIR.py -d directorio
en-es.tmx
gnome.tmx
en
es
en-GB
es-ES
eng
spa

We also have the graphical version MTUOC-TMXdetectlanguagesDIR-GUI.py and MTUOC-TMXdetectlanguagesDIR-GUI.exe, which has the following graphical interface where we have to indicate the input directory:

3. Conversion from TMX to tabulated text

If we already know the language codes of the file or files we want to convert, we have the option of doing it in Terminal or with a graphical interface and on a single file or on the whole directory.

To convert a single file into Terminal we can use the MTUOC-TMX2tabtxt.py program that has the -h option that shows the help:

python MTUOC-TMX2tabtxt.py -h
usage: MTUOC-TMX2tabtxt.py [-h] -i INPUTFILE -o OUTPUTFILE -s SLCODE [SLCODE ...] -t TLCODE [TLCODE ...] [--noTags]
                           [--simpleTags] [--noEntities] [--fixencoding]

MTUOC program for converting a TMX into a tab text.

options:
  -h, --help            show this help message and exit
  -i INPUTFILE, --in INPUTFILE
                        The input TMX file.
  -o OUTPUTFILE, --out OUTPUTFILE
                        The output text file.
  -s SLCODE [SLCODE ...], --sl SLCODE [SLCODE ...]
                        The code for the source language.
  -t TLCODE [TLCODE ...], --tl TLCODE [TLCODE ...]
                        The code for the target language.
  --noTags              Removes the internal tags.
  --simpleTags          Replaces tags with <t>, </t> or <t/>.
  --noEntities          Replaces html/xml entities by corresponding characters.
  --fixencoding         Tries to restore errors in encoding.

With -i we indicate the starting file and with -o the output file.

With -so --sl we indicate the corresponding code(s) of the starting language; and with -t or --tl those corresponding to the language of arrival. If we want to indicate more than one code we will do it separated by spaces, such as: -s en eng en-GB en-US.

The -noTags option removes all HTML/XML tags from segments. The --simpleTags option, on the other hand, replaces any tag with <t>, </t> or <t>.

The option --noEntities replaces HTML/XML entities with their corresponding characters.

The option --fixencoding tries to repair any character encoding errors that may exist in the file.

Now we can write the command, for example:

python MTUOC-TMX2tabtxt.py -i gnome.tmx -o gnome-eng-spa.txt -s en en-GB eng -t es es-ES spa --noTags --noEntities --fixencoding

We can also use the GUI version: MTUOC-TMX2tabtxt-GUI.py or MTUOC-TMX2tabtxt-GUI.exe, which presents the following interface:

We can also use the DIR version, which treats all files in a directory:

python MTUOC-TMX2tabtxtDIR.py -d directorio -o corpus-eng-spa.txt -s en en-GB eng -t es es-ES spa --noTags --noEntities --fixencoding

or the version with graphical interface: MTUOC-TMX2tabtxtDIR-GUI.py or MTUOC-TMX2tabtxtDIR-GUI.exe, with the following graphic interface:

⚠️ **GitHub.com Fallback** ⚠️