Tutorial: Search for translated segments in comparable corpora - mtuoc/tutorials GitHub Wiki

1. Introduction

So far we have aligned parallel documents, that is, documents in which most segments of the document in one language are also in the document in the other language. The programs we've seen are able to detect missing segments, and even look for alignments with different ratios of 1:1.

In this section we will see a technique that allows us to check if in two unparallel documents, there are segments that are equivalent of translation between them. This is possible through the so-called sentence embeddings, which is a concept similar to the word embeddings, but instead of representing a word with a vector, represents a whole sentence. For this task we will use multilingual embedding models, which allow to represent sentences in different languages in the same vector space. This way, sentences that are close in the vector space, will have many chances of being translation equivalents.

2. Recommended reading

Schwenk, H., Wenzek, G., Edunov, S., Grave, É., Joulin, A., & Fan, A. (2021, August). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 6490-6500).

3. Procedure for searching for translated segments

The steps to search for equivalent segments of non-parallel corpus translation are as follows:

  • We have a file with segments in one language (L1) and another with segments in another language (L2)
  • We represent the segments of each file using a multilingual sentence embedding model.
  • We compare the resulting vectors corresponding to the segments in L1 with all vector representations of the segments in L2. If a segment in L1 has an L2 segment with sufficiently similar vectors, these segments in L1 and L2 will probably be translation equivalents.

This process can be done with the programs MTUOC-bitext_mining.py and MTUOC-bitext_mining-GPU.py of the repository https://github.com/mtuoc/MTUOC-aligner. Both programs are the same, but the second allows to use GPU units. The process of vector representation and the search for similar vectors is computationally expensive, so it is often necessary to use GPU units.

To test these algorithms we will use the result of the conversion to text of the medline download with the generic programme, which if you remember created two text files, one with all the segments in English and the other with all the segments in Spanish. You can download these files from the following links:

http://lpg.uoc.edu/seminarioTAN/semana_3/text-en.txt

http://lpg.uoc.edu/seminarioTAN/semana_3/text-es.txt

We can write:

wget http://lpg.uoc.edu/seminarioTAN/semana_3/text-en.txt
wget http://lpg.uoc.edu/seminarioTAN/semana_3/text-es.txt

We will delete repeated segments from these files:

cat text-en.txt | sort | uniq | shuf > text-uniq-en.txt
cat text-es.txt | sort | uniq | shuf > text-uniq-es.txt

And now we can run the program as follows:

python3 MTUOC-bitext_mining.py text-uniq-en.txt text-uniq-es.txt aliSBERT-uniq-brut-en-es.txt

(you may lack some prerequisite, so if it gives you error, install it: sentence_transformes and faiss-cpu, if you do not have GPU)

The process without GPU is rather slow, so patience.

You can download the result directly from: http://lpg.uoc.edu/seminarioTAN/week_4/aliSBERT-uniq-brut-eng-spa.txt

The alignment consists of the segment on the L1, tabulator, the segment on the L2, tabulator and a confidence index. The results are sorted by this confidence index.

It's important to do a visual inspection here. Open the alignment file with a text editor and go down until the results are bad to determine the lowest index from which we will reject the results. Then you can use the selectAlignmentsFile.py program to select segments with a higher index than found.

Note also that there are L1 - L2 segment alignments, in the example, English - English. This is because these segments have appeared in Spanish texts. For now consider them a good alignment. In the 6th week we will learn to clean corpus and automatically check the language of the segments.