Tutorial: creation of comparable corpora from Wikipedia - mtuoc/tutorials GitHub Wiki

1. Introduction

This tutorial will be dedicated to learning to create comparable corpus from Wikipedia. This can be a very interesting resource, as this encyclopedia is in many languages and speaks many subjects.

2. Recommended reading

In this tutorial I recommend reading an article that talks about how the Wikimatrix corpus has been created, as it has certain relationship with what we are going to do.

Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2021, April). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1351-1361).

3. Wikipedia

Wikipedia is a collaborative encyclopedia available in some 300 different languages. The following should be taken into account about Wikipedia:

Wikipedias in each language are independent. Even if two wikipedias have an article on the same subject, it doesn't mean they're translations. Articles are produced independently, although it is true that they are often part of the translation of an article into another language. Therefore, articles on the same subject in two languages cannot be expected to be translations.
Wikipedia articles have an interlinguistic link, so that from an article in one language you can go directly to the same article in another language.
Items are classified by topic based on a number of categories. These categories are normally free, but there are a number of widely used categories that are generally common. These categories may also be related to areas of knowledge. Check: Wikipedia:Contents/Categories.
There are projects that convert Wikipedia contents into databases, for example dBPedia
All Wikipedia content is downloadable through dumps. They are very voluminous files, in the second part of this tutorial we will learn to work with these files.

4. Creation of corpus comparable to CCWikipedia

For this part of the activity we will use https://github.com/aoliverg/CCWikipedia. This application is explained in detail at wiki. As you can see, there's a version with a visual interface. You can use this, but I advise you to use the version in Terminal, since you can use it even on servers without graphical interface.

Remember that before starting, the database [CPfromWiki.sqlite] must be obtained (http://lpg.uoc.edu/smarterp/CPfromWiki.sqlite). ¡Do not create it, even being able to! To download it do:

wget http://lpg.uoc.edu/smarterp/CPfromWiki.sqlite

Now we can get CCWikipedia doing:

git clone https://github.com/aoliverg/CCWikipedia.git

We will use the createCCWCorpus.py program that offers the -h option to get the help:

python3 createCCWCorpus.py -h
usage: createCCWCorpus.py [-h] -d FILENAME -c CATEGORIA --level LEVEL --lang LANG -o OUTDIR [-a ARTICLELIST]

Script for the creation of parallel corpora from Wikipedia

options:
  -h, --help            show this help message and exit
  -d FILENAME, --database FILENAME
                        The CCW sqlite database to use.
  -c CATEGORIA, --categories CATEGORIA
                        The categories to search for (a category or a list of categories separated by ,
  --level LEVEL         The category level depth.
  --lang LANG           The language (two letter ISO code used in Wikipedia.
  -o OUTDIR, --output OUTDIR
                        The name of the sqlite database to be created.
  -a ARTICLELIST, --articlelist ARTICLELIST
                        The name of the text file containing the list of files.

If we want to download the medical articles in English and up to two levels below we can write:

python3 createCCWCorpus.py -d CPfromWiki.sqlite -c Medicine --level 2 --lang en -o medicine-eng -a medicine-eng.txt

Important, the output directory must exist, so we have to create them with:

mkdir medicine-eng
mkdir medicine-spa

The list of articles (which will also be saved in the medicine-eng.txt file) and the total of articles will appear on the screen. To confirm that we want to download them you have to click on Y.

...
Desmethylchlorotrianisene
Alpha-Hydroxyetizolam
Cannabielsoin
Deuterated drug
TOTAL PAGES 13554
Download? (Y/N)

And now for Spanish (check that the category or categories have to be given in English):

python3 createCCWCorpus.py -d CPfromWiki.sqlite -c Medicine --level 2 --lang es -o medicine-spa -a medicine-spa.txt
TOTAL CATEGORIES 350
TOTAL PAGES 3707
Download? (Y/N)

Download these articles or those corresponding to the languages and topics that interest you most.

Once downloaded we will segment all the files of the directories corresponding to the language of departure and the arrival:

python3 MTUOC-segmenterDIR.py -i medicine-eng -o medicine-seg-eng -s segment.srx -l English

python3 MTUOC-segmenterDIR.py -i medicine-spa -o medicine-seg-spa -s segment.srx -l Spanish

Now we concatenate all segments of each language and eliminate repeated:

cat ./medicine-seg-eng/* | sort | uniq | shuf > medicine-uniq-eng.txt
cat ./medicine-seg-spa/* | sort | uniq | shuf > medicine-uniq-spa.txt

We can count the segments obtained:

wc -l medicine-uniq-*
   742062 medicine-uniq-eng.txt
   185324 medicine-uniq-spa.txt

And now we can "align" them, or rather look for possible segments that are translation equivalents. We can do:

If we have GPU on our computer:

python3 MTUOC-bitext_mining-GPU.py medicine-uniq-eng.txt medicine-uniq-spa.txt medicine-aligned-brut-eng-spa.txt

If we do not have GPU we will do (but remember that the process can be really very slow):

python3 MTUOC-bitext_mining.py medicine-uniq-eng.txt medicine-uniq-spa.txt medicine-aligned-brut-eng-spa.txt

You can download the results of the processes from the following links:

Articles in English: http://lpg.uoc.edu/seminarioTAN/semana_5/medicine-eng.zip
Articles in Spanish: http://lpg.uoc.edu/seminarioTAN/semana_5/medicine-spa.zip
Segmented articles in English: http://lpg.uoc.edu/seminarioTAN/semana_5/medicine-seg-eng.zip
Segmented articles in Spanish: http://lpg.uoc.edu/seminarioTAN/semana_5/medicine-seg-spa.zip
Result of the alignment: http://lpg.uoc.edu/seminarioTAN/week_5/medicine-aligned-brut-eng-spa.txt

If you look at the result of the alignment, you will see that in the first positions (remember that the alignment file is sorted with the alignments with the highest confidence ratings in the first positions) there are English-English alignments. This is because there are also English segments in the Spanish articles, which is why they are aligned. But we will be able to filter this out automatically using corpus cleaning techniques, which we will see next week.

5. Direct use of Wikipedia dumps

In the previous section we explained how to create comparable corpora from Wikipedia using a specific programme. The programme uses a large database to know which articles to download, but the articles are downloaded directly from Wikipedia. This causes massive queries to be made to the Wikipedia website, which in the end can create a problem for Wikipedia itself. To avoid such massive queries, you can work directly with Wikipedia dumps, which are very large files containing all Wikipedia articles in a given language.

Dumps for Wikipedia, and all other Wikimedia projects, can be downloaded from https://dumps.wikimedia.org/backup-index.html.

For example, if we want to download the English wikipedia we search in this page ‘enwiki’ and follow the link. We can do the same with the Spanish wikipedia by searching for ‘eswiki’. The problem is that these files are HUGE and therefore the download takes a long time and we are going to use a lot of disk space.

To make this part more agile, let's practice with two languages that have smaller Wikipedias. Specifically, I propose to work right now with the Wikipedia in Asturian and Occitan. Everything we do will be exactly the same as for larger Wikipedias, such as the English and Spanish ones. But the test will be much more agile with small Wikipedias.

We download the Wikipedias:

wget https://dumps.wikimedia.org/astwiki/20240501/astwiki-20240501-pages-articles.xml.bz2
wget https://dumps.wikimedia.org/ocwiki/20240501/ocwiki-20240501-pages-articles.xml.bz2

The downloaded files are compressed in bz2. DO NOT DECOMPRESS THEM! We'll work directly with the compressed files. This is especially important when working with large Wikipedias. For example, we can view the content with bzmore or bzcat, instead of more and cat.

Observe the contents of these files. As you can see these are XML files that contain different information about the articles and the text of the files themselves, which is in wiki format.

5.1. Dump conversion into text files

Once we have the dumps, we are interested in extracting a text file for each of the articles. We may want to extract the texts of all articles, or we may want to limit ourselves to a number of categories.

In the repository https://github.com/aoliverg/dumpsWikipedia you have available a series of scripts that will facilitate this task.

The wikipedia2text.py script, which has the -h option showing the help, allows us to convert a wikipedia dump into text files, one for each article:

python3 wikipedia2text.py -h
usage: wikipedia2text.py [-h] -d DUMP_PATH -l LANGUAGE -o OUTDIR [-c CATEGORIES] [-t TITLESFILE]

Script to convert Wikipedia dumps to text files according to a set of categories

options:
  -h, --help            show this help message and exit
  -d DUMP_PATH, --dump DUMP_PATH
                        The wikipedia dump.
  -l LANGUAGE, --language LANGUAGE
                        The language code (en, es, fr ...).
  -o OUTDIR, --outdir OUTDIR
                        The output directory.
  -c CATEGORIES, --categories CATEGORIES
                        A file with one category per line.
  -t TITLESFILE, --titlesfile TITLESFILE
                        A file where the converted article titles will be stored. By default titles-list.txt.

If we want to convert the whole wikipedia dump (adapt the name of the dump to the one you have downloaded:

python3 wikipedia2text.py -d astwiki-20240501-pages-articles.xml.bz2 -l ast -o wikipedia-ast/

The output directory, wikipedia-ast in the example, will be created if it does not exist and will contain a text file for each article. In the file titles-list.txt, the title of the downloaded files will be saved. We may specify another name for this file with the option -t.

In many cases, however, it will be in our interest to limit the articles converted to text to a number of categories. To do this we can create a text file, for example categorias.txt, containing one category per line. The name of the category has to be in the language corresponding to the dump we are dealing with. For now create a text file categorias.txt containing for example Medicine (if you treat the dump in Asturian). In the next section we will see how we can explore the categories in a language. To limit the conversion to articles in the Medicine category (or the one you have in the categories.txt file) you can write:

python3 wikipedia2text.py -d astwiki-20240501-pages-articles.xml.bz2 -l ast -o wikipedia-medicina-ast/ -c categorias.txt

5.2. Category exploration

In order to know which categories to put in the category archive and thus limit the generated files, we have several options:

Manual option I. The English Wikipedia offers a page with categories organised by academic discipline: https://en.wikipedia.org/wiki/Outline_of_academic_disciplines. If you look at the top, this same page is available in other languages. If the language you are interested in is among these languages, you can get interesting information on the categories you are interested in there.
Manual option II. You can search for articles related to the topics you are interested in, and look at the bottom to see which categories these pages are related to. If you click on that category, a category page will open that often contains information about subcategories. Exploring all this you can generate the category file.
Automatic option: In the same repository we have the program exploreCategories.py that allows us to explore the categories that have the articles that have a series of categories that we want to explore. This program also has the option -h that shows us the help:

python3 exploreCategories.py -h
usage: exploreCategories.py [-h] -d DUMP_PATH -l LANGUAGE -c CATEGORY -o OUTFILE [--limit LIMIT]

Script to explore categories from a Wikipedia dump

options:
  -h, --help            show this help message and exit
  -d DUMP_PATH, --dump DUMP_PATH
                        The wikipedia dump.
  -l LANGUAGE, --language LANGUAGE
                        The language code (en, es, fr ...).
  -c CATEGORY, --categories CATEGORY
                        A category or a list of categories separated by :.
  -o OUTFILE, --outfile OUTFILE
                        The output directory.
  --limit LIMIT         The limit in number of articles found.

For example, if we want to see the categories associated with Medicine, we can write:

python3 exploreCategories.py -d astwiki-20240501-pages-articles.xml.bz2 -l ast -c Medicina --limit 100 -o categorias-medicina-ast.txt

The process is quite slow, so you can use the --limit option which stops the process when it has found a number of pages with the given category (in the example 100). If no limit is indicated, the programme scans the entire dump. The programme will create the file categories-medicine-ast.txt containing the list of related categories in descending order of frequency of occurrence, for example:

Medicine
Medical specialities
Biology
Anatomy
Biographies by activity
Biochemistry
Drugs
Ethics
Biophysics
Chemistry
Greek mythology
Linguistics
Nobel laureates
physiology
psychology
anthropology
sociology

On-screen, categories are also offered next to their frequency:

Medicine 33
Medical specialities 5
Biology 3
Anatomy 2
Biographies by activity 1
Biochemistry 1
Drugs 1
Ethics 1
Biophysics 1
Chemistry 1
Greek mitology 1
Linguistics 1
Nobel Laureates 1
Phisiology 1
Psichology 1
Anthropology 1
Sociology 1

We can edit this file to suit our needs and use it with the wikipedia2text.py program explained above.

Instead of a single category, it is possible to indicate a number of categories separated by ":", for example:

python3 exploreCategories.py -d astwiki-20240501-pages-articles.xml.bz2 -l ast -c "Medicina:Anatomía:Especialidaes médiques" --limit 100 -o categorias-medicina-ast.txt