Case Study: Aligning Môhan Wijayaratna FR translations - sc-voice/ms-dpd GitHub Wiki

Overview

One of the key innovations of SuttaCentral is document alignment. Alignment makes use of a canonical naming for parts of a Pali document. In the following aligned segment of a document, the alignment is given by "MN8:1.1", which is the unique identifier for that text segment across the entire Tipitaka:

MN8:1.1 Evaṁ me sutaṁ—

Alignment is a key innovation because it is fine-grained enough to reveal the structure of a document as wells as its rhythmic repetition of key phrases.
Structure is revealed in the "text segment identifier" (e.g., "mn8:1.1") with its numerical hierarchy (i.e., "1.1"). Repetition is revealed by isolating repeated text each in its own text segment.

MN8:4.2: Tassa evamassa:
MN8:5.2: Tassa evamassa:
MN8:6.2: Tassa evamassa:
MN8:7.2: Tassa evamassa:
MN8:8.2: Tassa evamassa:
MN8:9.2: Tassa evamassa:
MN8:10.2: Tassa evamassa:
MN8:11.2: Tassa evamassa:

Alignment is also a key innovation in that it allows translations to be compared with each as well as with Pali root text.

scid: mn8:1.1
 pli: Evaṁ me sutaṁ—
 ref: So I have heard. 
 fr: Ainsi ai-je entendu.

Contemporary Pali translations can be aligned or not-aligned. Given the utility of aligned translations, the questions arises, "Can non-aligned translations be aligned?".

FR Môhan Wijayaratna

The FR Môhan Wijayaratna translations (FRMW) are non-aligned. Yet with some effort, they can indeed be aligned. There are, however, some challenges:

Many-to-1 alignment
Differences in contemporary translations
Obscured key phrases

These challenges are all solvable with care, but they do warrant careful consideration.

Many-to-1 alignment

The FRMW comprise 67 paragraphs to be aligned with 170 segments. Since the alignment is not 1-to-1, a paragraph will typically be mapped to one or more segments. In addition, there may be text segments without a matching paragraph. Finally, there may be annotative paragraphs in the translation with no correspondence to the Tipitaka.

Differences in contemporary translations

Although contemporary aligned suttas can assist alignments, the task is complicated when the vocabulary differs. E.g., "sallekha" has been translated as "effacement" as well as "déracinement". Although the Digital Pali Dictionary (DPD) has been enormously helpful in addressing these differences, in this particular case, the FR translation of the DPD via DeepL does not cover both words. Indeed, the two words are not considered FR synonyms.

Obscured key phrases

Repetition is mnemonic, and the Pali suttas are designed to be mnemonic with their use of either literal repetition or template-driven repetition, where the latter calls attention to certain key phrases. E.g., "Others will X, here we will not X". When translations rely on paragraph structure that collects common information into a single paragraph, the Pali emphasis on key phrases often gets lost deep within a paragraph.

Key phrase in Pali repeated eight times by itself in MN8:

MN8:4.2: Tassa evamassa:

Obscured key phrase embedded within a sentence within a paragraph:

Se voit, Ô Cunda, la situation où un certain bhikkhu, s’étant séparé du désir, s’étant séparé des pensées inefficaces, entre dans le premier jhāna pourvu de raisonnement et de réflexion, qui est joie et bonheur, né de la séparation des choses mauvaises et il y demeure. CHEZ LUI PEUT SE PRODUIRE UNE PENSËE orgueilleuse en se disant : “Je demeure en ayant déraciné [les souillures mentales]”. Ces jhānas, ô Cunda, ne sont pas appelés “les états déracinés [des souillures mentales]” dans cette discipline noble. Par contre, dans cette discipline noble, ils sont appelés “les demeures heureuses où l’on vit dans cette vie présente”.

Alignment Techniques

Alignment is basically a challenge of comparison, a detection of similarity between two sequences of text. In our case we need to align paragraphs with text segments. Here are some useful techniques:

similarity functions
detecting repetition

Similarity Functions

Given two texts T1 and T2, we would like to have a function "similarity" with generates:

Zero: if T1 is completely different from T2
A real number in (0..1) that increases with increasing similarity between T1 and T2
1: if T1 is identical to T2

What formula should we use? Well the detection of similarity can be done in many ways, including AI. However, it is simple, effective and reliable to use word vectors. AI tends to be opaque in its generation of recommendations whose provenance is buried deep within neural nets. In contrast, word vectors are transparently and verifiably computed. For example, we can define a similarity function that asserts that "a cat went walking" is 0.75 similar to "a dog went walking" or "a cat went jogging". Indeed there is already such a function for word vectors called cosine similarity. Although the formula itself looks daunting, it does produce results that match an intuitive understanding that:

if 0 out of 4 words are the same in T1 and T2, then the similarity is 0
if 1 out of 4 words are the same in T1 and T2, then the similarity is 0.25
if 2 out of 4 words are the same in T1 and T2, then the similarity is 0.5
if 3 out of 4 words are the same in T1 and T2, then the similarity is 0.75
if 4 out of 4 words are the same in T1 and T2, then the similarity is 1

With a similarity formula we can do coarse alignment of two bodies of text. But relying only on a similarity formula is insufficient because we are comparing two text collections with different sizes. For MN8/fr, we are comparing 67 paragraphs with 170 text segments. This means that we could align each of those paragraphs more than one segment. But which segment would we choose if several adjacent text segments are similar? We could simply choose "the first similar segment", but we then fall into the trap of mistaking a similar segment is also being relevant. Let's take an example:

If we have the a simple paragraph:

I took a cat and went strolling, jogging and running

How do we match that paragraph to these four text segments in order. They are all quite similar to the paragraph, so we need to pick one. But which one is relevant?

a dog went running
a cat went walking
a cat went jogging
a cat went running

Upon careful examination, we soon see that the last three text segments are actually a group of relevant things about a cat. But the grouping only becomes apparent through repetition.

In other words, similarity does not guarantee relevance. So how can we align each of those 67 paragraphs with the "first relevant segment"? Will repetition help?

Detecting repetition

To determine the relevance of a text segment to a paragraph we have to look at the structure of repetition used within the suttas. MN8 is a classic example of such repetition with the repetition of Tassa evamassa:. Indeed, Tassa evamassa: marks the beginning of a block of related segments, and THAT use is quite important, since paragraphs serve the same grouping function. Since Tassa evamassa: is repeated in MN8, its presence is indeed relevant.

Fortunately, we can use our similarity function to detect relationships between segments. In particular, we can use repetition to detect groups of repeated text. In fact, here are some of the groups detected in MN8:

d8r.addCorpusSutta: [1]mn8:0.1#1  G1 1.000g  0.000p                                 
d8r.addCorpusSutta: [1]mn8:0.2#2  G2 1.000g  0.000p                           
d8r.addCorpusSutta: [1]mn8:1.1#3  G3 1.000g  0.000p                             
d8r.addCorpusSutta: [1]mn8:1.2#4  G4 1.000g  0.000p                                
d8r.addCorpusSutta: [1]mn8:2.1#5  G5 1.000g  0.062p 
d8r.addCorpusSutta: [1]mn8:1.2#4  G4 1.000g  0.000p                         
d8r.addCorpusSutta: [1]mn8:3.1#6  G6_9 0.875g  0.000p                                   
d8r.addCorpusSutta: [1]mn8:3.2#7  G7_10 1.000g  0.000p                                 
d8r.addCorpusSutta: [1]mn8:3.3#8  G8 1.000g  0.000p 
d8r.addCorpusSutta: [1]mn8:3.4#9  G6_9 0.875g  0.000p                          
d8r.addCorpusSutta: [1]mn8:3.5#10  G7_10 1.000g  0.000p                          
d8r.addCorpusSutta: [1]mn8:3.6#11  G11 1.000g  0.000p 
...

In the above we see that MN8 starts out with unrepeated segments mn8:0.1..mn8:2.1. But then we see that an interesting pattern emerges with segment mn8:3.1. It is interesting because mn8:3.1 is similar (i.e., 0.875) to mn8:3.4, the only difference only being in who is being addressed.

MN8:3.1: “yā imā, bhante, anekavihitā diṭṭhiyo loke uppajjanti—
MN8:3.4: “Yā imā, cunda, anekavihitā diṭṭhiyo loke uppajjanti—

Understanding that repetition forms groups of meaning, we can also understand that paragraphs should not split groups. Indeed, we would not want to mix what the Buddha said with what Cunda said even if they used the same words. So repetition is absolutely crucial for correct alignment. One might even be inclined to think that repetition is a key feature of the Tipitaka that has ensured its error-free transmission through millenia. Repetition is critical for error-free information transfer today, and the Tipitaka's reliance on precise repetition certainly resonates with modern error correction and detection.

For alignment, then, we can use repetition to define alignment groups that will increase alignment precision.