Defensio vocabulary - nevenjovanovic/modruski-temrezah GitHub Wiki
Defensio ecclesiasticae libertatis - a vocabulary analysis
A list of unique word forms
Create a list of unique word forms with the XQuery script defensio-create-word-list.xq. At the same time, normalise for capital letters and v / u, j / i.
The result is in vocabulary/defensio-word-list.txt.
LEMLAT results
Analyse with the XQuery script call-local-LEMLAT-defensio.xq.
The program analysed a list of unique forms (see above).
The analysis of 6,396 forms took 56 seconds.
- No Worforms: 6396
- No Unknown: 359
- No Analysed: 6037
Analyse the unknown forms
After correcting a couple of incorrectly tokenized words, there are 15,019 tokens in the Defensio. Statistically insignificant, but philologically important.
The unknowns turn out to be:
- Names
- Uncertain readings from the source
- Incorrectly tokenized words
- Orthographic and morphological variants of lemmata in the LEMLAT base
- Belonging to lemmata missing from the LEMLAT base
Process the results into unambiguous and ambiguous analyses
Use the TEI scheme elements.
Call the LEMLAT XML results file (after correcting the double quote in lemma atribute value) with the script LEMLAT-xml-into-TEI-fragment-defensio.xq.
Unambiguous vs. ambiguous parses
Of 6037 analysed word forms, there are 2291 unambiguously analysed; 1033 forms with only derivationally ambiguous lemmata; 2713 with multiple candidates.
The unambiguous analyses can be connected with occurrences immediately. To achieve that we will use the @lemRef
attribute and LEMLAT id_lemma unique identifier.
The derivationally ambiguous lemmata point to the same LEMLAT id_lemma (good design!) and they can be connected immediately too.
The remaining ambiguous analyses (as well as the 359 unknown forms) have to be inspected in context, and probabilities will be assigned -- in view of reusing them elsewhere.
Connect the unambiguous LEMLAT analyses with their occurrences in the Defensio
- Create the
def-lemlat
db holding thedefensio-LEMLAT-TEI-manual.xml
file - Use the
modr-def-texts
and thedef-lemlat
dbs - Retrieve all unambiguous (and only derivationally ambiguous) forms and their
@n
atrribute values - For each form, find occurrences in the word-tokenized edition of the Defensio (what happens with -que? LEMLAT lemmatizes them dismissing the enclytic)
- To each occurrence of the form, add the
@lemmaRef
attribute (holding the lemlat id from@n
)
This leaves 9,165 occurrences unannotated (out of 15,012 tokens); 5,847 occurrences were connected with their LEMLAT lemma reference.
Because the "derived" analyses point to the same lemma, another thousand or so identifications was added, so that in the second pass 7,270 out of 15,012 tokens was annotated.
After some minor corrections, there are 15,019 tokens in the Defensio, of which 7,746 have no lemma annotations (yet).
Annotate the forms with ambiguous parses
- Create a list of unannotated w elements
- Count occurrences, add them to the list
- Work first with forms with only one occurrence (the easier problem, less work)
Analyses annotated as multilem but really deriv
It turns out that a number of analyses lead to the same lemma through different interpGrp elements. After changing their @ana
value to deriv
, the number of unannotated w elements is down to 7,200.
Forms with multiple analyses, but only one occurrence
We start by reading the occurrence in context, and retrieving the LEMLAT set of analyses (in our TEI-transformed version). Done with the XQuery script retrieve-LEMLAT-and-occurrence.xq.
The @cert
attribute of the appropriate analysis is then updated to "100", the others are set to zero. This is done with the XQuery replace-lemlat-analysis-with-certainties.xq. The script uses the transform with
expression to change several values in a node at once. The updated node is then written back to the database, and the file is exported to the vocabulary
directory.