Case Study: Word Vectors - sc-voice/ms-dpd GitHub Wiki

Intuitively, a word vector is just a bunch of words assigned numeric values.

{the:2, brown:1, fox:1, jumped:1, over:1, lazy:1, dog:1}

The numbers assigned can serve different purposes. In the above example, the numbers are just the count of words in

the brown fox jumped over the lazy dog

But the numbers can also represent complicated relationships as well. With the right formulas, word vectors can become word embeddings that can determine the degree of similarity between two documents. Indeed, word embeddings are so useful that we use them daily to search the internet. Basically, an internet search is just a request to compare the word embedding for one document (the query we typed) with the word embeddings for all the other documents in the world, ranking them in descending order of similarity.

Dictionary-based Word Embeddings

Dictionaries define the meaning of a word by using...more words. If we look up the word evaṁ in the Digital Pali Dictionary, we see more words that together form the meaning of evaṁ.

> dpd evam 
========== find:evam method:unaccented ==========
 # KEY LEMMA  PAT MEANING
 1 4iU evaṁ 1     thus; this; like this; similarly; in the same manner; just as; such
 2 4iV evaṁ 2     yes!; that is right!; correct!

Using the DPD we can form a word vector for evaṁ. For example, we could create a Bag of Words (BoW) word vector based on the number of occurrences of each word from the above:

{ this:1, this:2, like:1, similarly:1, in:1, the:1, same:1, manner:1, just:1, as:1, such:1, yes:1, that:1, is:1, right:1, correct:1}

Although counting words does work somewhat for document comparison, it has some glaring weaknesses. For example, notice that the word the is equal importance with right in the word count vector for evaṁ. The word the is omnipresent and therefore useless as a basis for comparison. We want to look for the needle in a haystack, rather than looking at all the straws that are not needles. The weight of a word depends on its context. A word like the is ubiquitous and therefore useless in comparison. However, rarer words like right are more valuable because they are less frequent in context.

We need a better way to generate word vectors for given contexts. We could, for example, use AI neural nets to generate word vectors. But using AI neural nets to generate word vectors leads to the introduction of uncertain outcomes due to the randomness inherent in AI. So we need something more deterministic. We need formulas that produce repeatable results with certainty.

Term-Frequency/Inverse-Document-Frequency

Before the current reliance on AI and neural nets, the world of search engines relied on formulas such as the tf-idf equations. These formulas do take context into account so that words like the have zero effect on the word vector. To see this in action, let's look at an actual MS-DPD word vector.

Here is the MS-DPD tf-idf word vector for evaṁ me sutaṁ. You can see that the word le (EN: the) occurs at the end of the vector with a small, insignificant number:

entendu:0.075,moi:0.050,je:0.036,ma:0.025,mes:0.025,
même:0.025,c_est:0.025,comme:0.025, ceci:0.025,du:0.024,
la:0.024,lettre:0.021,(gram):0.020,à:0.015,(mesure):0.013,
√mā:0.013,1ère:0.013,31ème:0.013,apprentissage:0.013,
appris:0.013,connaissance:0.013,fille:0.013,fils:0.013,
lune:0.013,m:0.013,nasale:0.013,oui:0.013,part:0.013,
pluriel:0.013,quelque:0.013,terminaison:0.013,verbale:0.013,
vrai:0.012,depuis:0.012,consonne:0.012,tel:0.012,
(objet):0.012,correct:0.012,par:0.012,ainsi:0.012,
chose:0.012,quand:0.012,manière:0.012,présent:0.012,
pour:0.012,que:0.012,alphabet:0.012,moi-même:0.012,qui:0.011,
(gramme):0.011,ce:0.010,personne:0.010,de:0.007,est:0.005,
le:0.005

This is a rather large word vector, but oddly enough, we can see that the most important element of the word vector is entendu and the second most important element is moi. And overall, we can see that this rather large word vector is comparable to donc j'ai entendu, which is one FR translation of so i have heard, which is the EN translation of evaṁ me sutaṁ.

Aligning legacy suttas

Dictionary-based word embeddings (DBWE) allow us to compare the meanings of blocks of text reliably and predictably. Since the DPD is a Pali dictionary, we can used DBWEs to align legacy sutta translations in any MS-DPD language to the SuttaCentral segment numbering:

MN8:1.1: Evaṁ me sutaṁ—
MN8:1.1: So I have heard.

Using DBWEs, we can match up the following text from Môhan Wijayaratna.

Ainsi ai-je entendu: une fois le Bienheureux séjournait dans le parc d’Anāthapiṇḍika, au bois de Jeta, près de la ville de Sāvatthi.

We can match it up with MN8:1.1 because that segment has the highest matching score based on DBWEs. The neighboring segments return lower scores:

  mn8:1.1: 0.16320343700431444, // this is the one!
  mn8:1.2: 0.13922716083775374,
  mn8:2.1: 0.12486307791576028,

This match is remarkable, because the larger legacy translation is so different than the Pali segment. Furthermore, the legacy translation has 69 lines, which need to be aligned to 179 segments. But with DBWE-matching, we can construct a consistent, practical and effective alignment.