Ranking Results: The Scoring System - carihaas/tesserae GitHub Wiki
Posted on February 15, 2013 by ncoffee
Tesserae search begins by matching a minimum of two words in one text with two words in another. The words can be matched either by their exact forms or by their dictionary headwords. Using headword matching permits, for instance, the Latin _tuli _to match latus, both forms of the headword fero.
For comparisons of even moderate-sized texts, basic matching produces thousands of results. We have therefore created a scoring system to sort results by likely potential interest.
Higher scores are given to parallels where the matched words in each text are closer together and where the matched words are more rare. Our testing has found that the top results produced by this method correspond well with the results found by commentators. In other words, preliminary tests show the current Tesserae identification and scoring processes help substantially to identify the most meaningful results.
Full testing of this system is still in progress, however, as are efforts to improve it further. In the meantime, the following description gives a somewhat more detailed account of its function.
First, the frequency of each matching term is calculated by dividing its count within its respective text by the total number of words in that text.
The frequency of a word will thus be different in the search and target texts. In a lemma-based search (the default), the count for a word includes every occurrence of an inflected form with which it shares one or more possible lemmata. These frequencies (very small fractions, even for the most common words) are then inverted and the results are added together across both phrases. The result is a very large number. This is then divided by the distance covered by the matching words in the source and target phrase.
Distance in each phrase is calculated as the number of tokens spanned by (and including) the two most-infrequent matching words. The distances from the source and target phrases are added together to make the overall distance. Finally, the natural logarithm of the result is taken. This helps to bring the exponential differences in word frequencies that occur in natural language into a more linear and human-interpretable range. For a given parallel, the rarer the words are, and the closer they are together in their respective texts, the higher its score will be.