Latin Greek Search : Competing Methods - carihaas/tesserae GitHub Wiki
Posted on October 18, 2013 by Chris Forstall
Given the indebtedness of many Latin literary forms to earlier Greek originals, it has long been a goal of ours at Tesserae to one day implement a Latin-Greek search on our site. Currently, word-level n-grams form the foundation of the principal search algorithm. To apply this system where a Latin text alludes to Greek, Tesserae requires a translation dictionary linking Greek lemmata to associated Latin terms.
James Gawley and I are currently working on two different methods for producing such a dictionary. James is working on the “parallel texts” method. This method compares the Greek New Testament with Jerome’s Latin text to probabilistically assign a Latin translation (actually, several likely candidates) to each Greek word. James is writing an algorithm for machine text alignment based on Bayes’ theorem. This algorithm, similar to more complex models such as the IBM methods for machine alignment, looks at the frequency with which each Latin word appears in the same verses as each Greek word.
My method, the “dictionary method,” uses English as a pivot language. Expanding on a method developed by Jeff Rydberg-Cox at Perseus, I compare entries in the Liddell-Scott Greek-English lexicon with entries in the Lewis and Short Latin-English lexicon using the Gensim topic modelling package. The similarity of a given Greek and Latin headword is determined based on the similarity of their English definitions in the two dictionaries.
Each method produces its own Greek-Latin translation set. These are used to “translate” Tesserae’s existing Greek lemma indices, which can then be searched against the Latin indices. The success of this method depends a lot on how many Greek lemmata we can successfully link with Latin translations (a better term might be “related words”). While it’s still in the alpha stage, it shows a lot of promise.
For example, in the opening of Vergil’s poem, the narrator asks his Muse about the causes of the Trojans’ trials as they wandered with Aeneas:
Musa, mihi causas memora, quo numine laeso (Aen. 1.8) |
---|
Muse, remind me of the causes, on account of which god’s anger… |
Compare the words of Priam to Helen, as, gazing from the wall at the warriors below, he reflects on the source of the Trojans’ suffering:
οὔ τί μοι αἰτίη ἐσσί, θεοί νύ μοι αἴτιοί εἰσιν (Il. 3.164) |
---|
To me, you are not the cause; to me, the gods are the causes… |
In this case, the dictionary method allows Tesserae to detect the parallel based on the correspondences, _numine _(“god”) ~ θεοί (“gods”), and _causas _(“causes”) ~ αἰτίη/αἴτιοι(“cause”/“causes”).
We’re pitting the two methods against each other, head to head. They’ll be tested by their ability to detect a subset of Aeneid–Iliad parallels collated from G. N. Knauer’s Die Aeneis und Homer by Konnor Clark and Amy Miu, and similar to our Lucan-Vergil benchmark set. For now, you can test them on our development site herep). (NB: once you’re at the development page, links lead to other development pages. To leave the develop branch click on the blog link in the upper right.)
While each of the two methods on its own can identify significant Latin-Greek allusions, we ultimately aim to combine their output in a single feature set. We’re excited to be presenting this work at DHCS 2013 in Chicago this December 5–7.