Morphology files - adamb924/mortal-engine GitHub Wiki
Morphology files specify the morphological structure of the language. Here is an example of a minimal morphology file (examples/00-Minimal.xml):
<?xml version="1.0" encoding="UTF-8"?>
<morphology
    xmlns="https://www.adambaker.org/mortal-engine"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
    <writing-systems src="writing-systems.xml"/>
    <model label="All Stems">
        <!-- stem-list is the only node in the model -->
        <!-- this model will only accept words that are bare stems -->
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
        </stem-list>
    </model>
</morphology>In the <morphology> tag there are two child elements:
- 
<writing-systems>has information about the writing systems of the file, which in this case is a link to a separate file. This is discussed more here.
- 
<model>is the sole model in this morphology. This is where the magic happens. In this simple model there is just a simple<stem-list>element.
A variety of things can go inside the <morphology> tag. I've put them into three groups below.
A stem list is list of stems from the language, i.e., the lexicon. There are two different types of stem lists. The one above is a simple <stem-list>; it loads a list of stems that is specified in another XML file (01-stems.xml above). The other type is <sqlite-stem-list>, which reads the stems from a SQLite database.
The <morpheme> is the superstar of any morphological model, as you'd expect. (Suggested hook for your Linguistics 101 term paper: “Morphemes are very important to morphology.”) Morphemes and stem lists are the only nodes that “eat” segments in a parsing (or “produce” them in a generation).
The <mutually-exclusive-morphemes> tag is used when the language offers a choice of one of a set of morphemes. For instance, you generally just get one possessive morpheme. So a <mutually-exclusive-morphemes> would have each possessive morpheme: first singular, first plural, second singular, etc.
A <fork> is a fork in the parsing. It contains one or more paths. This is useful for when there are two different paths a derivation can take, for instance a verb stem that could end up as a simple infinitive, or as a finite form with several more affixes tacked on at the end.
A <sequence> is like a dumb fork: a sequence of morphemes (or other nodes) that can occur as a group, or not.
A <jump> jumps you to another node in the morphology—anywhere in the morphology, not just in the same model. So, within a verb model, you could have a parsing that hits a nominalizer, and then use a <jump> tag to jump over to a node in the noun model, so that you can append your nominal morphology.
Most (all?) languages will require more than one morphological model. Below is an example (from examples/09-Multiple-Models.xml) of a model that selects different stems from the stem list and adds different suffixes to them.
<?xml version="1.0" encoding="UTF-8"?>
<morphology
    xmlns="https://www.adambaker.org/mortal-engine"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
    <writing-systems src="writing-systems.xml"/>
    <!-- this model is for nouns, which can take the 'Case' suffix -->
    <model label="Nouns">
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
            <matching-tag>noun</matching-tag>
        </stem-list>
        <morpheme label="Case">
            <allomorph>
                <form lang="wk-LA">Case</form>
            </allomorph>
        </morpheme>
    </model>
    <!-- this model is for nouns, which can take the 'Tense' suffix -->
    <model label="Verbs">
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
            <matching-tag>verb</matching-tag>
        </stem-list>
        <morpheme label="Tense">
            <allomorph>
                <form lang="wk-LA">Tense</form>
            </allomorph>
        </morpheme>
    </model>
</morphology>The result is that the “Tense” suffix can occur only after verbs, and the “Case” suffix can occur only after nouns.
Success: The input bilTense (wk-LA) was accepted by the model, which is correct. [Stem][Tense]
Success: The input bilCase (wk-LA) was rejected by the model, which is correct. 
Success: The input ataTense (wk-LA) was rejected by the model, which is correct. 
Success: The input ataCase (wk-LA) was accepted by the model, which is correct. [Stem][Case]