Morphology files - adamb924/mortal-engine GitHub Wiki
Morphology files specify the morphological structure of the language. Here is an example of a minimal morphology file (examples/00-Minimal.xml):
<?xml version="1.0" encoding="UTF-8"?>
<morphology
xmlns="https://www.adambaker.org/mortal-engine"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
<writing-systems src="writing-systems.xml"/>
<model label="All Stems">
<!-- stem-list is the only node in the model -->
<!-- this model will only accept words that are bare stems -->
<stem-list label="Stem">
<filename>01-stems.xml</filename>
</stem-list>
</model>
</morphology>
In the <morphology>
tag there are two child elements:
-
<writing-systems>
has information about the writing systems of the file, which in this case is a link to a separate file. This is discussed more here. -
<model>
is the sole model in this morphology. This is where the magic happens. In this simple model there is just a simple<stem-list>
element.
A variety of things can go inside the <morphology>
tag. I've put them into three groups below.
A stem list is list of stems from the language, i.e., the lexicon. There are two different types of stem lists. The one above is a simple <stem-list>; it loads a list of stems that is specified in another XML file (01-stems.xml
above). The other type is <sqlite-stem-list>, which reads the stems from a SQLite database.
The <morpheme> is the superstar of any morphological model, as you'd expect. (Suggested hook for your Linguistics 101 term paper: “Morphemes are very important to morphology.”) Morphemes and stem lists are the only nodes that “eat” segments in a parsing (or “produce” them in a generation).
The <mutually-exclusive-morphemes> tag is used when the language offers a choice of one of a set of morphemes. For instance, you generally just get one possessive morpheme. So a <mutually-exclusive-morphemes>
would have each possessive morpheme: first singular, first plural, second singular, etc.
A <fork> is a fork in the parsing. It contains one or more paths. This is useful for when there are two different paths a derivation can take, for instance a verb stem that could end up as a simple infinitive, or as a finite form with several more affixes tacked on at the end.
A <sequence> is like a dumb fork: a sequence of morphemes (or other nodes) that can occur as a group, or not.
A <jump> jumps you to another node in the morphology—anywhere in the morphology, not just in the same model. So, within a verb model, you could have a parsing that hits a nominalizer, and then use a <jump>
tag to jump over to a node in the noun model, so that you can append your nominal morphology.
Most (all?) languages will require more than one morphological model. Below is an example (from examples/09-Multiple-Models.xml) of a model that selects different stems from the stem list and adds different suffixes to them.
<?xml version="1.0" encoding="UTF-8"?>
<morphology
xmlns="https://www.adambaker.org/mortal-engine"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
<writing-systems src="writing-systems.xml"/>
<!-- this model is for nouns, which can take the 'Case' suffix -->
<model label="Nouns">
<stem-list label="Stem">
<filename>01-stems.xml</filename>
<matching-tag>noun</matching-tag>
</stem-list>
<morpheme label="Case">
<allomorph>
<form lang="wk-LA">Case</form>
</allomorph>
</morpheme>
</model>
<!-- this model is for nouns, which can take the 'Tense' suffix -->
<model label="Verbs">
<stem-list label="Stem">
<filename>01-stems.xml</filename>
<matching-tag>verb</matching-tag>
</stem-list>
<morpheme label="Tense">
<allomorph>
<form lang="wk-LA">Tense</form>
</allomorph>
</morpheme>
</model>
</morphology>
The result is that the “Tense” suffix can occur only after verbs, and the “Case” suffix can occur only after nouns.
Success: The input bilTense (wk-LA) was accepted by the model, which is correct. [Stem][Tense]
Success: The input bilCase (wk-LA) was rejected by the model, which is correct.
Success: The input ataTense (wk-LA) was rejected by the model, which is correct.
Success: The input ataCase (wk-LA) was accepted by the model, which is correct. [Stem][Case]