Case Study: Repetition in the Tipitaka - sc-voice/ms-dpd GitHub Wiki

Similarity groups

The Tipitaka suttas rely heavily on repetition. Indeed, repetition is integral to the suttas and serves many purposes.

memorization repetition reinforces memory
mindfulness repetition with minor variations cultivates mindfulness in memorization and recitation
comprehension repetition of complex patterns trains the mind to recognize and use memorized texts within the context of their variations.
error detection and correction repetition creates rhythms that can highlight inadvertent text omissions or mutations during recitation.

To study repetition in the suttas, we need to detect repetition, and the key to detecting repetition is a similarity formula. A similarityy formula is simply one that:

returns 0 if two texts have no words in common
returns 1 if two texts have all words in common
return something between when some words are in common

TFIDF

_NOTE: This section is included for background and rigor since it deals briefly with math formulas. However, it can be skipped for a general understanding.

Similarity scores are tremendously useful. Indeed, we rely on similarity scores with every internet search. And because they are so useful, there are many ways to compute similarity scores. Artificial Intelligence (AI) can be used to compute similarity scores. But AI is complex and undertain. Fortunately, there are much simpler and more reliable ways to formalize similarity.

In this libray, we rely on a modified TFIDF (term-frequency/inverse-document-frequency) formula. The Term-Frequency/Inverse-Document-Frequency formula used in this work is different than the standard formulas described in tf-idf equations:

{idf(t,D)}=\log{\frac{N}{|\{d:d\in D{\text{ and }}t\in d\}|}}

The standard formula is not normalized, nor is it tunable. Normalization to the interval [0..1] provides a bounded scale for quick comparison. By tunable, we refer to a formula that allows us to change how the TFIDF similarity score relates to rarity of a word in the corpus of documents. Empirically, an idfWeight of ~1.618 proved useful in finding documents that are similar due to the sharing of many infrequent words. In fact this metric is critical for examining the Tipitaka corpus, which is characterized by repetitive text having subtle changes of a few words. Repetition was not a key motivator for the original IDF formula.

Following is the amended IDF formula used as the default for this library.

    idf = nDocs
      ? 1 - Math.exp(((wdc - nDocs) / wdc) * idfWeight)
      : 1;

TFIDF Similarity

With the normalized and tunable IDF formula, we can now generate TFIDF word vectors to calculate the similarity of two documents in a corpus. (see cosine similarity comparison). Cosine similarity generates numbers in the interval [-1,1]. However, since the TFIDF vectors are all positive, the actual range of similarity scores is the interval [0,1]. The interval [0,1] is quite useful, since the interval [0,1]can be interpreted intuitively as:

0: nothing in common
1: everything in common

Some examples will help illustrate this use of TFIDF similarity.

Simlarity Exact match

The similarity score for an exact match is 1. For example, comparing the following segment to itself:

MN8:4.2: Tassa evamassa:

Simlarity Completely different

The TFIDF similarity score for a complete mismatch is 0. For example, the two following segments have no words in common:

MN8:4.2: Tassa evamassa:
MN8:4.3: ‘sallekhena viharāmī’ti.

Similarity Partial match

The similarity score for similar but slightly different text is between 0 and 1. This is where the idfWeight comes into play because we want to identify texts that share a common template. In the following case, the TFIDF similarity score is 0.857 since there is only a single word different:

MN8:3.1: “yā imā, bhante, anekavihitā diṭṭhiyo loke uppajjanti—
MN8:3.4: “Yā imā, cunda, anekavihitā diṭṭhiyo loke uppajjanti—

In the following case, the similarity score is 0.433, which is lower because more words are different:

MN8:4.1: Ṭhānaṁ kho panetaṁ, cunda, vijjati yaṁ idhekacco bhikkhu vivicceva kāmehi vivicca akusalehi dhammehi savitakkaṁ savicāraṁ vivekajaṁ pītisukhaṁ paṭhamaṁ jhānaṁ upasampajja vihareyya.
MN8:5.1: Ṭhānaṁ kho panetaṁ, cunda, vijjati yaṁ idhekacco bhikkhu vitakkavicārānaṁ vūpasamā ajjhattaṁ sampasā danaṁ cetaso ekodibhāvaṁ avitakkaṁ avicāraṁ samādhijaṁ pītisukhaṁ dutiyaṁ jhānaṁ upasampajja vihareyya.

Repetition groups

The new TFIDF similarity formula allows us to detect and group repeated text within a sutta. To do so, we treat the sutta as a corpus of documents where each document is an individual text segment. Then we group all similar segments into similarity groups.

We will use MN8 to illustrate how we detect and characterize repetition in the Tipitaka. MN8 uses extensive repetition to link concepts into a semantic web of memorable meaning. Individual segments are repeated, and groups of segments are also repeated. The resulting structure is complex.

Because the structure of repetition is complex, we first dicuss non-repeating text. Even though they are non-repeating, it is actually useful to assign a repetition group of 1 element to each non-repeating segment. Since these singleton groups only have 1 element, the group has a self-similarity of 1. Indeed, the first and second segments of MN8 assigned singleton groups which we can name: G1 and G2.

1:G1 MN8:0.1: Majjhima Nikāya 8
...
2:G2 MN8:0.2: Sallekhasutta

Proceeding further through MN8, we encounter our first repetition with segments 6 and 9, which have a similarity score of 0.875:

6:G6.9 MN8:3.1: “yā imā, bhante, anekavihitā diṭṭhiyo loke uppajjanti—
...
9:G6.9 MN8:3.4: “Yā imā, cunda, anekavihitā diṭṭhiyo loke uppajjanti—

These two segments are similar, so we assign a shared similary group named, G6.9 (aka. Group of segment#6 and segment#9). The G6.9 group has a 0.857 score. This strategy looks promising.

And at this point we encounter segment#7, which is exactly like segment#10. Fortunately, our grouping strategy also works well here. We can create a new group G7.10 with a similary group score of 1 (they are identical) and add both segments to the new group.

7:G7.10 MN8:3.2: attavādapaṭisaṁyuttā vā lokavādapaṭisaṁyuttā vā—
...
10:G7.10 MN8:3.5: attavādapaṭisaṁyuttā vā lokavādapaṭisaṁyuttā vā—

Continuing on, we find that this grouping strategy also works for segment#8, which is similar to segment#11 with a score of 0.431. Notice that the human eye does not casually notice much similarity there even though there is indeed significant similarity:

8:G8.11 MN8:3.3: ādimeva nu kho, bhante, bhikkhuno manasikaroto evametāsaṁ diṭṭhīnaṁ pahānaṁ hoti, evametāsaṁ diṭṭhīnaṁ paṭinissaggo hotī”ti?
...
11:G8.11 MN8:3.6: yattha cetā diṭṭhiyo uppajjanti yattha ca anusenti yattha ca samudācaranti taṁ ‘netaṁ mama, nesohamasmi, na me so attā’ti—evametaṁ yathābhūtaṁ sammappaññā passato evametāsaṁ diṭṭhīnaṁ pahānaṁ hoti, evametāsaṁ diṭṭhīnaṁ paṭinissaggo hoti.

In summary, we have found that grouping segments by similarity easily reveals the repetition built into the Tipitaka. This algorithm can be adjusted as needed for studying any document in the Tipitaka.

However, there is even more to consider beyond simple repetition. There is more to consider because suttas such as MN8 use repetition hierarchically.

Stanzas and Overlapping groups

We previously discovered and named the following three groups:

G6.9 mn8:3.1… (0.857) head of overlapping groups
G7.10 mn8:3.2… (1.000) overlapping group
G8.11 mn8:3.3… (0.431) overlapping group

Looking at the three groups together, we notice that they overlap. At first this seems wrong to have groups of repetitions overlap. What is going on?

Overlapping groups are an indication of hierarchical repetition. Here we have two levels of repetition. The first level of repetition consists of groups of repeating segments. The second level of repetition consists of similar blocks of adjacent segments. For convenience, we refer to a block of adjacent segments as "stanzas".

segment	similarity group	stanza	extent
1 (mn8:1.1) Evaṁ me sutaṁ	1	1	[1,1]
6 (mn8:3.1) “yā imā, bhante,...	G6.9 (head)	6,7,8	[6,10]
7 (mn8:3.2) attavādapaṭisaṁyuttā...	G7.10	6,7,8	[6,10]
8 (mn8:3.3) ādimeva nu kho...	G8.11	6,7,8	[6,10]
9 (mn8:3.4) “Yā imā, cunda,...	G6.9 (head)	9,10,11	[9,13]
10 (mn8:3.5) attavādapaṭisaṁyuttā...	G7.10	9,10,11	[9,13]
11 (mn8:3.6) yattha cetā diṭṭhiyo...	G8.11	9,10,11	[9,13]

By detecting similarity, we can identify groups of repeating segments that follow a certain template. By identifying groups of repeating segments, we can infer the existence of stanzas, which can also repeat following their own template.

In summary, stanzas are defined by repetition.

An unrepeated segment is a one segment stanza (e.g., "Evaṁ me sutaṁ")
Repeating segments defined stanzas that start with the segment repetition (e.g., [mn8:3.1..mn8:3.3])
If groups of repeating segments overlap, stanzas are determined by the head group. The overlapped stanzas contribute to the extent of the head group, which has an overlap of +2.

We now have enough terminology to discuss how analyzing similarity, repetition groups and repeating stanzas can be of use.

Sutta Alignment

The segment numbering system of SuttaCentral provides a wonderful structure for referencing and comparing the Tipitaka root texts with their contemporary translations. Unfortunately, there are many contemporary translations that do not use the SuttaCentral number system and it is therefore difficult to directly compare these legacy translations with translations aligned to the SuttaCentral segment numbering system.

Although it is certainly possible to manually align legacy translations, the similarity analysis described above can be quite helpful, especially in cases where legacy translations do not have the same number of lines as the segment. Legacy translations are often grouped by paragraphs instead of text segments. For example, Dr. Môhan Wijayaratna's translation of MN8 consists of 67 lines, while the SuttaCentral Pali text for MN8 consists of 170 text segments.

Consider the following paragraph:

« Vénéré, si toutes ces opinions diverses concernant la théorie du Soi ou concernant la théorie du monde se produisent chez les gens, sont-elles éliminées tout au début chez un bhikkhu lorsqu’il réfléchit correctement ? Ainsi, y a-t-il un abandon de ces opinions ? »

Using similarity analysis, we can see that the above paragraph is similar to

MN8:3.1: “yā imā, bhante, anekavihitā diṭṭhiyo loke uppajjanti—
MN8:3.2: attavādapaṭisaṁyuttā vā lokavādapaṭisaṁyuttā vā—
MN8:3.3: ādimeva nu kho, bhante, bhikkhuno manasikaroto evametāsaṁ diṭṭhīnaṁ pahānaṁ hoti, evametāsaṁ diṭṭhīnaṁ paṭinissaggo hotī”ti?

Although this is promising, we also notice that because of repetition, the paragraph is also similar to the Buddha's response to Cunda's question:

MN8:3.4: “Yā imā, cunda, anekavihitā diṭṭhiyo loke uppajjanti—

To resolve the ambiguity introduced by repetition, we make use of stanzas detected as described above. Fortunately, paragraphs themselves often serve the same purpose as a stanza of text segments. Paragraphs are semantic units and the Buddha's repetitions are also semantic. And because the Buddha's repetitions are so consistent and rhythmic, they allow us to align legacy paragraphs to stanzas of the Pali Tipitaka of SuttaCentral.