Morphology files - adamb924/mortal-engine GitHub Wiki

Morphology files specify the morphological structure of the language. Here is an example of a minimal morphology file (examples/00-Minimal.xml):

<?xml version="1.0" encoding="UTF-8"?>
<morphology
    xmlns="https://www.adambaker.org/mortal-engine"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
    <writing-systems src="writing-systems.xml"/>
    <model label="All Stems">
        <!-- stem-list is the only node in the model -->
        <!-- this model will only accept words that are bare stems -->
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
        </stem-list>
    </model>
</morphology>

In the <morphology> tag there are two child elements:

  • <writing-systems> has information about the writing systems of the file, which in this case is a link to a separate file. This is discussed more here.
  • <model> is the sole model in this morphology. This is where the magic happens. In this simple model there is just a simple <stem-list> element.

A variety of things can go inside the <morphology> tag. I've put them into three groups below.

Stem Lists

A stem list is list of stems from the language, i.e., the lexicon. There are two different types of stem lists. The one above is a simple <stem-list>; it loads a list of stems that is specified in another XML file (01-stems.xml above). The other type is <sqlite-stem-list>, which reads the stems from a SQLite database.

Morphemes

The <morpheme> is the superstar of any morphological model, as you'd expect. (Suggested hook for your Linguistics 101 term paper: “Morphemes are very important to morphology.”) Morphemes and stem lists are the only nodes that “eat” segments in a parsing (or “produce” them in a generation).

Control structures (forks, paths, etc.)

The <mutually-exclusive-morphemes> tag is used when the language offers a choice of one of a set of morphemes. For instance, you generally just get one possessive morpheme. So a <mutually-exclusive-morphemes> would have each possessive morpheme: first singular, first plural, second singular, etc.

A <fork> is a fork in the parsing. It contains one or more paths. This is useful for when there are two different paths a derivation can take, for instance a verb stem that could end up as a simple infinitive, or as a finite form with several more affixes tacked on at the end.

A <sequence> is like a dumb fork: a sequence of morphemes (or other nodes) that can occur as a group, or not.

A <jump> jumps you to another node in the morphology—anywhere in the morphology, not just in the same model. So, within a verb model, you could have a parsing that hits a nominalizer, and then use a <jump> tag to jump over to a node in the noun model, so that you can append your nominal morphology.

Multiple Models

Most (all?) languages will require more than one morphological model. Below is an example (from examples/09-Multiple-Models.xml) of a model that selects different stems from the stem list and adds different suffixes to them.

<?xml version="1.0" encoding="UTF-8"?>
<morphology
    xmlns="https://www.adambaker.org/mortal-engine"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.adambaker.org/mortal-engine morphology.xsd">
    <writing-systems src="writing-systems.xml"/>
    <!-- this model is for nouns, which can take the 'Case' suffix -->
    <model label="Nouns">
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
            <matching-tag>noun</matching-tag>
        </stem-list>
        <morpheme label="Case">
            <allomorph>
                <form lang="wk-LA">Case</form>
            </allomorph>
        </morpheme>
    </model>
    <!-- this model is for nouns, which can take the 'Tense' suffix -->
    <model label="Verbs">
        <stem-list label="Stem">
            <filename>01-stems.xml</filename>
            <matching-tag>verb</matching-tag>
        </stem-list>
        <morpheme label="Tense">
            <allomorph>
                <form lang="wk-LA">Tense</form>
            </allomorph>
        </morpheme>
    </model>
</morphology>

The result is that the “Tense” suffix can occur only after verbs, and the “Case” suffix can occur only after nouns.

Success: The input bilTense (wk-LA) was accepted by the model, which is correct. [Stem][Tense]
Success: The input bilCase (wk-LA) was rejected by the model, which is correct. 
Success: The input ataTense (wk-LA) was rejected by the model, which is correct. 
Success: The input ataCase (wk-LA) was accepted by the model, which is correct. [Stem][Case]
⚠️ **GitHub.com Fallback** ⚠️