Defensio vocabulary - nevenjovanovic/modruski-temrezah GitHub Wiki

Defensio ecclesiasticae libertatis - a vocabulary analysis

A list of unique word forms

Create a list of unique word forms with the XQuery script defensio-create-word-list.xq. At the same time, normalise for capital letters and v / u, j / i.

The result is in vocabulary/defensio-word-list.txt.

LEMLAT results

Analyse with the XQuery script call-local-LEMLAT-defensio.xq.

The program analysed a list of unique forms (see above).

The analysis of 6,396 forms took 56 seconds.

No Worforms: 6396
No Unknown: 359
No Analysed: 6037

Analyse the unknown forms

After correcting a couple of incorrectly tokenized words, there are 15,019 tokens in the Defensio. Statistically insignificant, but philologically important.

The unknowns turn out to be:

Names
Uncertain readings from the source
Incorrectly tokenized words
Orthographic and morphological variants of lemmata in the LEMLAT base
Belonging to lemmata missing from the LEMLAT base

Process the results into unambiguous and ambiguous analyses

Use the TEI scheme elements.

Call the LEMLAT XML results file (after correcting the double quote in lemma atribute value) with the script LEMLAT-xml-into-TEI-fragment-defensio.xq.

Unambiguous vs. ambiguous parses

Of 6037 analysed word forms, there are 2291 unambiguously analysed; 1033 forms with only derivationally ambiguous lemmata; 2713 with multiple candidates.

The unambiguous analyses can be connected with occurrences immediately. To achieve that we will use the @lemRef attribute and LEMLAT id_lemma unique identifier.

The derivationally ambiguous lemmata point to the same LEMLAT id_lemma (good design!) and they can be connected immediately too.

The remaining ambiguous analyses (as well as the 359 unknown forms) have to be inspected in context, and probabilities will be assigned -- in view of reusing them elsewhere.

Connect the unambiguous LEMLAT analyses with their occurrences in the Defensio

Create the def-lemlat db holding the defensio-LEMLAT-TEI-manual.xml file
Use the modr-def-texts and the def-lemlat dbs
Retrieve all unambiguous (and only derivationally ambiguous) forms and their @n atrribute values
For each form, find occurrences in the word-tokenized edition of the Defensio (what happens with -que? LEMLAT lemmatizes them dismissing the enclytic)
To each occurrence of the form, add the @lemmaRef attribute (holding the lemlat id from @n)

This leaves 9,165 occurrences unannotated (out of 15,012 tokens); 5,847 occurrences were connected with their LEMLAT lemma reference.

Because the "derived" analyses point to the same lemma, another thousand or so identifications was added, so that in the second pass 7,270 out of 15,012 tokens was annotated.

After some minor corrections, there are 15,019 tokens in the Defensio, of which 7,746 have no lemma annotations (yet).

Annotate the forms with ambiguous parses

Create a list of unannotated w elements
Count occurrences, add them to the list
Work first with forms with only one occurrence (the easier problem, less work)

Analyses annotated as multilem but really deriv

It turns out that a number of analyses lead to the same lemma through different interpGrp elements. After changing their @ana value to deriv, the number of unannotated w elements is down to 7,200.

Forms with multiple analyses, but only one occurrence

We start by reading the occurrence in context, and retrieving the LEMLAT set of analyses (in our TEI-transformed version). Done with the XQuery script retrieve-LEMLAT-and-occurrence.xq.

The @cert attribute of the appropriate analysis is then updated to "100", the others are set to zero. This is done with the XQuery replace-lemlat-analysis-with-certainties.xq. The script uses the transform with expression to change several values in a node at once. The updated node is then written back to the database, and the file is exported to the vocabulary directory.