Tutorial: Automatic alignment of documents with Hunalign - mtuoc/tutorials GitHub Wiki
In this tutorial we will learn to align documents with Hunalign, an old but still very actively used program in the creation of parallel corpus. It is a program in Terminal that will allow us to have wider control over the whole process and align a large number of documents at once in a very agile way.
D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005). Parallel corpora for medium density languages In Proceedings of the RANLP 2005, pages 590-596. (pdf)
In the previous section we learned how to use LF-Aligner, a program based on the famous hunalign. In this section we will learn to directly use hunalign, which will allow us to align a large number of documents in a single step. But Hunalign, unlike LF-Aligner, only aligns the documents, but there are a number of generic steps that we will have to take ourselves. The generic steps for document alignment are:
- Conversion of files to text.
- Segmentation of text files. In general we want to have an alignment at the sentence or segment level, so it will be necessary to divide the text of the document, which is usually organized into paragraphs, segments or sentences.
- Alignment itself, which we will perform with hunalign. The program will link the segments of the original file to the segments of the translated file.
- Selection of segments, based on a quality index that offers hunalign for each segment.
- Conversion of alignment to required format. At the end of the process we will have a tabulated text file that we may need to convert into some other format, such as TMX.
In this step you can align the files you want, but for the test I propose the same files that we have aligned with LF-Aligner:
You can do this first step with the program you prefer. Here is a program of the MTUOC toolkit that allows you to convert a large number of formats to text: MTUOC-any2text
All details on how to use this tool can be found in its Wiki, so here I will simply reproduce the instructions that we have to use to perform the conversion. Before running the programs do not forget to install the prerequisites.
To convert the file N2130752-eng.docx to text we will write in Terminal:
python3 MTUOC-any2text.py -i N2130752-eng.docx
And to convert the file N2130755-spa.docx we will write:
python3 MTUOC-any2text.py -i N2130755-spa.docx
You will see that text files have been created by adding the.txt extension to the file names.
This step can also be done with many programs but I suggest to use https://github.com/mtuoc/MTUOC-segmenter. Here we will use a segmenter based on a SRX file (Segmentation Rules eXchange). The program provides a segment.srx file, but you can use any other SRX file. Open the segment.srx file with a proper text editor and observe the content. You will see that there are a number of languages that have specific rules. If your working languages are not included in this srx file, you can use the Generic language, look for another srx file or take a look at the More about segmentation section later this week.
The explanation of the program can be found on its Wiki, so here we will just provide the necessary instructions. Again, do not forget to install the prerequisites as previously explained.
To segment the file N2130752-eng.docx.txt we will write in terminal:
python3 MTUOC-segmenter.py -i N2130752-eng.docx.txt -o N2130752-seg-eng.docx.txt -s segment.srx -l English -p
And to segment the file N2130755-spa.docx.txt we will write:
python3 MTUOC-segmenter.py -i N2130755-spa.docx.txt -o N2130755-seg-spa.docx.txt -s segment.srx -l Spanish -p
Observe the segmented files (with more or by just opening them in a text editor). If you notice, we have used the -p option to add the paragraph mark to each paragraph jump. This information is useful for hunalign.
The Hunalign website provides all the details of how to use this tool. You can also download the hunalign binaries for Linux, Windows and Mac from https://github.com/mtuoc/hunalign. These files are:
- For Linux: hunalign* For Windows: hunalign.exe as well as msvcp100.dll and msvcr100.dll * For Mac: hunalignMAC (maybe you will have to change the name to hunalign, or just remember to run the program with the correct name)
In order to align with hunalign, a bilingual dictionary must be available in the appropriate format. You can obtain dictionaries or create them yourself from the repository https://github.com/aoliverg/hunapertium or using the MUSE2Hunalign.py program distributed with MTUOC-aligner. Although the accuracy of alignment improves with the use of dictionaries, if none are available for the working language pair, you can work with an empty file (with MTUOC-aligner the empty dictionary null.dic is distributed). In Unix you can create an empty dictionary by writing:
touch null.dic
With this week's files we distribute the dictionary hunapertium-en-es.dic
./hunalign hunapertium-en-es.dic N2130752-seg-eng.docx.txt N2130755-seg-spa.docx.txt -text -utf -realign > alineacion-eng-spa.txt
In the Hunalign documentation you can find the explanation for each of these parameters. Now in alinacion-eng-spa.txt we have the result of automatic alignment. Here is a fragment:
<p> <p> 0
I. I. 1.8Stocktaking of key global health and foreign policy commitments Balance of key global health and foreign policy commitments1.09821 0
Paragraph marks <p> are aligned with paragraph marks, but have an index of 0, as it is an alignment that will not be useful to us. In the rest of the segments we have the English segment, tabulator, Spanish segment, tabulator and reliability index.
### Step 4. Selection of segments
Now what we want to do is select the pairs of segments with a higher reliability than a certain threshold, for example, 0. We can do this with the selectAlignmentsFile.py program that is distributed with [MTUOC-aligner](https://github.com/mtuoc/MTUOC-aligner). The -h option shows you the help. We can write:
`python3 selectAlignmentsFile.py -i alineacion-eng-spa.txt -o alineacion-seleccionada-eng-spa.txt -c 0`
And only pairs of segments with a reliability index exceeding 0 will be selected. Observe the result with more or by opening it in a text editor.
### Step 5. Conversion to final format
Now we have our parallel corpus in a tabulated text format, which is already going to go well to train engines. This corpus may contain repeated segments, remember that we can remove them with the following instruction:
```cat alineacion-seleccionada-eng-spa.txt | sort | uniq | shuf > alineacion-unic-eng-spa.txt```
If we want to convert these unique segments to Moses format we can do:
cut -f 1 alineacion-unic-eng-spa.txt > alineacion-unic.en-es.en cut -f 2 alineacion-unic-eng-spa.txt > alineacion-unic.en-es.es
Not to train engines, but perhaps to use this parallel corpus as a translation memory in our assisted translation tools, we may be interested in converting alignment into TMX format. For this we can use the program [MTUOC-tabtxt2TMX.py](https://github.com/mtuoc/MTUOC-tabtxt2TMX). This program shows the help with option -h:
python3 MTUOC-tabtxt2TMX.py -h usage: MTUOC-tabtxt2TMX.py [-h] -i FENTRADA -o FSORTIDA -s L1 -t L2
MTUOC-tabtxt2TMX: A script to convert a parallel corpus in tabbed text into a TMX file.
options: -h, --help show this help message and exit -i FENTRADA, --input FENTRADA The input file to convert. -o FSORTIDA, --output FSORTIDA Fix some issues in PDF conversion. -s L1, --L1code L1 The language code for the source language. -t L2, --L2code L2 The language code for the target language.
To convert the file alineacion-unic-eng-spa.txt to TMX we can write:
`python3 MTUOC-tabtxt2TMX.py -i alineacion-unic-eng-spa.txt -o alineacion-eng-spa.tmx -s en-US -t es-ES`
#5 Alignment of multiple documents with hunalign
The advantage of using hunalign over LF-Aligner, besides having more control over all parameters, is the possibility to align hundreds or thousands of document pairs in a very fast way. In this section we will learn to align 260 pairs of docx documents that you can find in the following compressed files:
* [2023-en.zip](https://github.com/mtuoc/tutorials/blob/main/Hunalign/2023-en.zip)
* [2023-es.zip](https://github.com/mtuoc/tutorials/blob/main/Hunalign/2023-es.zip)
Download and decompress these files.
The programs we have presented for the different steps of alignment of two documents are also in DIR version, which is able to treat all the files of a given directory.
**VERY IMPORTANT:** In order to automatically align multiple files, the names of the starting and arrival language files:
* must be exactly the same name. For example, file1.txt and file1.txt (which as they will be in different directories, there will be no problem)* or, they will have exactly the same name but may differ in codes at the end that indicate the language of the file. For example, file1-en.txt and file1-es.txt
We will now see all the steps to align multiple pairs of documents:
### Step 1. Conversion of files to text
We will use the DIR version of MTUOC-any2text, MTUOC-any2textDIR.py. You can see the instructions for use with the -h option.
python3 MTUOC-any2textDIR.py -i 2023-en/ -o 2023-txt-en python3 MTUOC-any2textDIR.py -i 2023-es/ -o 2023-txt-es
In directories 2023-txt-en and 2023-txt-es we have files converted to text.
### Step 2. Files segmentation
We will use the DIR version of MTUOC-segmenter, MTUOC-segmentertDIR.py. You can see the instructions for use with the -h option.
python3 MTUOC-segmenterDIR.py -i 2023-txt-en/ -o 2023-seg-en -s segment.srx -l English -p python3 MTUOC-segmenterDIR.py -i 2023-txt-es/ -o 2023-seg-es -s segment.srx -l Spanish -p
In the 2023-seg-en and 2023-seg-es directories we have the segmented text files.
### Step 3. Alignment
The alignment step with hunalign will be carried out using the batch mode offered by the program. Hunalign can be provided with a batch file that in each line contains: file segmented in the starting language TABULATOR file segmented in the arrival language TABULATOR file that will contain the alignment.
[MTUOC-aligner](https://github.com/mtuoc/MTUOC-aligner) provides a program, MTUOC-create-batchfile.py, which automatically creates this batch file. The -h option shows the following help:
python3 MTUOC-create-batchfile.py -h usage: MTUOC-create-batchfile.py [-h] --dirSL DIRSL --dirTL DIRTL --dirALI DIRALI --batchfile BATCHFILE [--r1 R1] [--r2 R2]
A script to create the batch file to be used with hunalign.
options: -h, --help show this help message and exit --dirSL DIRSL The input dir containing the segmented text files for the source language. --dirTL DIRTL The input dir containing the segmented text files for the target language. --dirALI DIRALI The output dir to save the aligned files. --batchfile BATCHFILE The name of the alignment script. --r1 R1 The first string for name replacement. --r2 R2 The second string for name replacement.
Look carefully at the files we are going to align: the name of the files is exactly the same for the starting language and the arrival language. If we execute the following command:
`python3 MTUOC-create-batchfile.py --dirSL 2023-seg-en/ --dirTL 2023-seg-es --dirALI 2023-ali-en-es --batchfile batchfile.txt`
The program will create the batch file called batchfile.txt. It also shows the following information per screen:
*** como-zimbabue-esta-construyendo-un-estado-de-vigilancia-del-gran-hermano.txt.docx.txt
*** en-turquia-tribunal-condena-a-prision-a-popular-alcalde.txt.docx.txt
*** este-juego-en-linea-expone-los-peligros-de-la-mineria-en-alta-mar.txt.docx.txt
Which indicates that these files in the source language do not have their match in the target language.
In case the files have language codes at the end of the name, for example, all the files in the starting language end up in "-en.txt" and the files in the arrival language in "-es.txt", we would run the program using the options --r1 and --r2, as follows:
`python3 MTUOC-create-batchfile.py --dirSL 2023-seg-en/ --dirTL 2023-seg-es --dirALI 2023-ali-en-es --batchfile batchfile.txt --r1 en.txt --r2 es.txt`
Observe the contents of the batchfile.txt
Now we can align all files with a single instruction:
`./hunalign -batch hunapertium-en-es.dic -text -utf -realign batchfile.txt`
In the 2023-ali-en-es directory we will have all the files aligned.
### Step 4. Selection of segments
The selection of the files will be done with the selectAlignmentsDir.py program, as follows (remember that the -h option shows the program's help).
`python3 selectAlignmentsDir.py -i 2023-ali-en-es/ -o alineaciones-seleccionadas-eng-spa.txt -c 0`
### Step 5. Conversion to final format
In this step nothing changes regarding the alignment of two documents, so you can repeat the actions previously explained.