Affix rules: lingala verb example - TrnsltLife/HunspellXML GitHub Wiki
This example will guide you through the process involved in creating HunspellXML affixation rules for a simplified analysis of regular Lingala verbs.
A sample HunspellXML file that uses these affix rule can be found here.
When designing your affixation rules, it can be very helpful to first create position class charts to help describe the morphology of the different classes of words in your language (e.g. nouns, verbs, etc.)
The process of creating a position class chart is described in Paul Kroeger's book Analyzing Grammar: An Introduction
Let's assume that a linguist has analyzed Lingala and presented us with the following position class chart. With this in hand, we're now ready to begin creating the HunspellXML that will allow the spell checker to create valid spelling rules based on this information.
Subject (-3) | Tense (-2) | Reflexive (-1) | Verb Root (0) | Extension (+1) | Tense/Aspect (+2) | ||||||||||||||||||||||||||
|
Ø- |
|
|
|
|
||||||||||||||||||||||||||
ko- (future) |
|
||||||||||||||||||||||||||||||
|
Ø- | ||||||||||||||||||||||||||||||
Ø- | ko- (infinit.) |
A more schematic version of this table might look like this:
Prefix 3 | Prefix 2 | Prefix 1 | Root | Suffix 1 | Suffix 2 | |||||
[Subject] | → | [Nothing] | → | (Reflexive) | → | [Verb Root] | → | (Extension) | → | [Tense/Aspect] |
→ | [Future] | → | → | → | → | [Final -a Vowel] | ||||
[Imperative Subject] | → | [Nothing] | → | → | → | → | ||||
[Nothing] | → | [Infinitive] | → | → | → | → |
- Obligatory nodes are in [square brackets].
- Nodes in (parentheses) are optional.
- [Nothing] indicates a blank node (i.e. nothing goes there).
In working through this presentation of the data, you're meant to start in a cell at the left and work through on the same level all the way across the chart. The cell you end in should be in the same row as the cell you started in. Of course, the [Subject], (Reflexive), [Verb Root], and [Final -a Vowel] nodes occupy more than one row, but you should still be able to follow the flow of arrows at the level of a single row. That represents the valid spelling for all Lingala verbs in this model.
There is a problem with these charts though. We can't transfer these ideas directly into Hunspell, because Hunspell has a maximum limit of three affix slots that can be defined in addition to the root word in the dictionary. That's a problem because our chart above has 3 prefix slots and 2 suffix slots: a total of 5. (Hunspell also limits these 3 affixes to being a combination of either 2 prefixes and 1 suffix, or 1 prefix and 2 suffixes.)
Fortunately, we can cross-multiply the affixes from one slot with a neighboring slot, and reduce the number of affix slots. This comes at a cost of having to specify more affix rules. For example, cross-multiplying the 7 "Subject" prefixes with the 2 affixes in the "Reflexive" prefix (null prefix and "mi-") would result in 14 rules. Forunately, HunspellXML makes it easier to specify these rules using the <multiply><group>...</group></multiply>
syntax.
In this chart, we combine neighboring affixes to further simplify the schematic chart above, leaving us with just a single prefix slot and a single suffix slot.
Prefix 1 | Root | Suffix 1 | ||
1. [Subject+Reflexive](SU) | → | [Verb Root] | → | 5. [Extension+Tense/Aspect](ET) |
2. [Subject+Future+Reflexive](SF) | → | → | 6. [Extension+Final -a Vowel](EA) | |
3. [Imperative Subject+Reflexive](IS) | → | → | ||
4. [Infinitive+Reflexive](IF) | → | → |
As you can see, there are a total of 6 affixation rules that need to be defined.
Since a suffix is always required, and prefixes are optional in some cases, we will create the suffix rules first, and then create the prefix rules.
Looking back at the Standard Position Chart, we can see that we need to create a rule that combines the subject prefix na/o/a/e/to/bo/ba with either a null or mi- prefix for the reflexive slot.
Since there are two prefix slots that we are combining together into one Hunspell affix slot, we need to use the HunspellXML <multiply><group>...</group></multiply>
elements to describe these rules.
<prefix flag="SU" cross="true">
<multiply>
<group>
<!-- simple subject prefixes -->
<rule add="na" remove="0" where="." combineFlags="NA ET"/>
<rule add="o" remove="0" where="." combineFlags="NA ET"/>
<rule add="a" remove="0" where="." combineFlags="NA ET"/>
<rule add="e" remove="0" where="." combineFlags="NA ET"/>
<rule add="to" remove="0" where="." combineFlags="NA ET"/>
<rule add="bo" remove="0" where="." combineFlags="NA ET"/>
<rule add="ba" remove="0" where="." combineFlags="NA ET"/>
</group>
<group>
<!-- null morpheme or "mi-" reflexive morpheme -->
<rule add="" remove="0" where="." combineFlags="NA ET"/>
<rule add="mi" remove="0" where="." combineFlags="NA ET"/>
</group>
</multiply>
</prefix>
The basic usage of the <prefix>
and <suffix>
elements
and the <rule .../>
are described here.
In the code above:
-
flag="SU"
is the shorthand name for this rule which can be referred to by a root word in the dictionary wordlist file, and by other affix rules. -
cross="true"
means that this prefix rule is allowed to combine with suffix rules, thus "crossing" over the root word in position slot 0. - the
<multiply>
element signals that instead of a set of simple rules, this prefix will contain several groups of rules. Each group will be cross-multiplied with each other group to create a much longer list of affix rules. - the
<group>
elements (there should be at least two) contain a list of<rule.../>
elements. - in the
<rule.../>
element:-
add="na"
means that "na" should be added as a prefix to the word. -
remove="0"
means that nothing should be removed from the front of the word before adding the "na" prefix. - where="." is a simplified regular expression, indicating that the "na" can be added to any word no matter what character the word starts with.
- combineFlags="NA ET" - this rule indicates what other prefix or suffix rules this rule may combine with.
- In this case, the rule may combine with the "ET" rule. We'll define that rule below in Rule 5: [Extension+Tense/Aspect].
- The "NA" presupposes that in the
<settings>
section of the affix file, we have specified a settings rule like this:<needAffix flag="NA"/>
. The "needAffix" element creates a flag we can use to mean that the current word or affix rule cannot stand alone. It must be combined with another affix before it results in a valid word. You can see what this means when you look at the chart. It's not sufficient to have a position -2 slot (subect) and the root word. A tense suffix is also required. So the "combineFlags" rule is saying that these rules must be combined with another affix rule to be valid, and that the (only) affix rule that it can combine with in this case is "ET".
-
In this case, the prefix rule above is equivalent to writing out each individual rule like this:
<prefix flag="SU" cross="true">
<!-- simple subject prefixes -->
<rule add="na" remove="0" where="." combineFlags="NA ET"/>
<rule add="o" remove="0" where="." combineFlags="NA ET"/>
<rule add="a" remove="0" where="." combineFlags="NA ET"/>
<rule add="e" remove="0" where="." combineFlags="NA ET"/>
<rule add="to" remove="0" where="." combineFlags="NA ET"/>
<rule add="bo" remove="0" where="." combineFlags="NA ET"/>
<rule add="ba" remove="0" where="." combineFlags="NA ET"/>
<!-- subject prefixes plus the "mi-" reflexive morpheme -->
<rule add="nami" remove="0" where="." combineFlags="NA ET"/>
<rule add="omi" remove="0" where="." combineFlags="NA ET"/>
<rule add="ami" remove="0" where="." combineFlags="NA ET"/>
<rule add="emi" remove="0" where="." combineFlags="NA ET"/>
<rule add="tomi" remove="0" where="." combineFlags="NA ET"/>
<rule add="bomi" remove="0" where="." combineFlags="NA ET"/>
<rule add="bami" remove="0" where="." combineFlags="NA ET"/>
</prefix>
Now we create similar prefix multiply group rules for the subject + future morphemes. Note that the <prefix flag="SF" cross="true">
is different from above. We're specifying a different rule so it gets a different flag name - "SF" for "Subject + Future". (What you choose for the flag can be arbitrary as long as it is unique among your prefix and affix definitions. But it's helpful to use a code that is a mnemonic for something as in this case.)
Also note that while the "SU" prefix rules required ("NA") combination with the "ET" rule (Extension + Tense), the future rule requires ("NA") combination with the "EA" rule (Extension + A-vowel) which we'll define below.
<prefix flag="SF" cross="true">
<multiply>
<group>
<!-- Subject prefixes -->
<rule add="na" remove="0" where="." combineFlags="NA EA"/>
<rule add="o" remove="0" where="." combineFlags="NA EA"/>
<rule add="a" remove="0" where="." combineFlags="NA EA"/>
<rule add="e" remove="0" where="." combineFlags="NA EA"/>
<rule add="to" remove="0" where="." combineFlags="NA EA"/>
<rule add="bo" remove="0" where="." combineFlags="NA EA"/>
<rule add="ba" remove="0" where="." combineFlags="NA EA"/>
</group>
<group>
<!-- Future morpheme -->
<rule add="ko" remove="0" where="." combineFlags="NA EA"/>
</group>
<group>
<!-- null morpheme or "mi-" reflexive morpheme -->
<rule add="" remove="0" where="." combineFlags="NA EA"/>
<rule add="mi" remove="0" where="." combineFlags="NA EA"/>
</group>
</multiply>
</prefix>
The above prefix multiply group maintains a clear distinction between the three slots - subject, future, and reflexive. It would be fine though to combine the future "ko" slot with the reflexive "mi" slot to result in the following prefix multiply group.
<prefix flag="SF" cross="true">
<multiply>
<group>
<!-- Subject prefixes -->
<rule add="na" remove="0" where="." combineFlags="NA EA"/>
<rule add="o" remove="0" where="." combineFlags="NA EA"/>
<rule add="a" remove="0" where="." combineFlags="NA EA"/>
<rule add="e" remove="0" where="." combineFlags="NA EA"/>
<rule add="to" remove="0" where="." combineFlags="NA EA"/>
<rule add="bo" remove="0" where="." combineFlags="NA EA"/>
<rule add="ba" remove="0" where="." combineFlags="NA EA"/>
</group>
<group>
<!-- Future ko- morpheme + null morpheme or "mi-" reflexive morpheme -->
<rule add="ko" remove="0" where="." combineFlags="NA EA"/>
<rule add="komi" remove="0" where="." combineFlags="NA EA"/>
</group>
</multiply>
</prefix>
Here is one way to write the rules for the imperative+reflexive prefixes:
<prefix flag="IS" cross="true">
<multiply>
<group>
<rule add="" remove="0" where="." combineFlags="NA EA"/>
<rule add="bo" remove="0" where="." combineFlags="NA EA"/>
</group>
<group>
<rule add="" remove="0" where="." combineFlags="NA EA"/>
<rule add="mi" remove="0" where="." combineFlags="NA EA"/>
</group>
</multiply>
</prefix>
But since there are only two rules in each group, it might actually be easier to just specify the four necessary rules. This depends on individual preference however. Less typing, or greater similarity with the position class chart?
<prefix flag="IS" cross="true">
<rule add="" remove="0" where="." combineFlags="NA EA"/>
<rule add="bo" remove="0" where="." combineFlags="NA EA"/>
<rule add="mi" remove="0" where="." combineFlags="NA EA"/>
<rule add="bomi" remove="0" where="." combineFlags="NA EA"/>
</prefix>
The infinitive+reflexive rule can be described with just two rules: ko+null and ko+mi. So it doesn't make sense to encapsulate them in a prefix multiply group. Let's just use a set of simple prefix rules:
<prefix flag="IF" cross="true">
<rule add="ko" remove="0" where="." combineFlags="NA EA"/>
<rule add="komi" remove="0" where="." combineFlags="NA EA"/>
</prefix>
Now we need to write the suffix rules. For the extension+tense rule, which combines the extension slot and the tense/aspect slot, the power of affix multiply groups can really be seen. Our suffix multiply group should look like this:
<suffix flag="ET" cross="true">
<multiply>
<group>
<!-- null morpheme and verb extension morphemes -->
<rule add="" remove="0" where="."/>
<rule add="ol" remove="0" where="."/>
<rule add="is" remove="0" where="."/>
<rule add="el" remove="0" where="."/>
<rule add="am" remove="0" where="."/>
<rule add="an" remove="0" where="."/>
</group>
<group>
<rule add="a" remove="0" where="."/>
<rule add="i" remove="0" where="."/>
<rule add="aka" remove="0" where="."/>
<rule add="aki" remove="0" where="."/>
</group>
</multiply>
</suffix>
If we weren't using the affix multiply group, we would have to write the rules like this:
<suffix flag="ET" cross="true">
<rule add="a" remove="0" where="."/>
<rule add="ola" remove="0" where="."/>
<rule add="isa" remove="0" where="."/>
<rule add="ela" remove="0" where="."/>
<rule add="ama" remove="0" where="."/>
<rule add="ana" remove="0" where="."/>
<rule add="i" remove="0" where="."/>
<rule add="oli" remove="0" where="."/>
<rule add="isi" remove="0" where="."/>
<rule add="eli" remove="0" where="."/>
<rule add="ami" remove="0" where="."/>
<rule add="ani" remove="0" where="."/>
<rule add="aka" remove="0" where="."/>
<rule add="olaka" remove="0" where="."/>
<rule add="isaka" remove="0" where="."/>
<rule add="elaka" remove="0" where="."/>
<rule add="amaka" remove="0" where="."/>
<rule add="anaka" remove="0" where="."/>
<rule add="aki" remove="0" where="."/>
<rule add="olaki" remove="0" where="."/>
<rule add="isaki" remove="0" where="."/>
<rule add="elaki" remove="0" where="."/>
<rule add="amaki" remove="0" where="."/>
<rule add="anaki" remove="0" where="."/>
</suffix>
That's a lot more typing! It's a lot easier to make mistakes, and a lot harder to update and maintain. And you can imagine that in some languages with even more morpheme combinations, the rules could become completely unmanageable without the use of affix multiply groups.
The extension+a-vowel rule is simpler. It only has two position slots, and one of those slots only has one option (the -a vowel). So it's reasonable to use a simple suffix rule and skip the multiply group.
<suffix flag="EA" cross="true">
<rule add="a" remove="0" where="."/>
<rule add="ola" remove="0" where="."/>
<rule add="isa" remove="0" where="."/>
<rule add="ela" remove="0" where="."/>
<rule add="ama" remove="0" where="."/>
<rule add="ana" remove="0" where="."/>
</suffix>
We always start our rule chain with a word in a dictionary. Each word in the dictionary specifies what (if any) affixation rules can apply to it. Without specifying any affixation rules, only the word as written in the dictionary list will be accepted as validly spelled.
In this toy Lingala dictionary, we are only concerning ourselves with verbs. In the separate dictionary file, or in the <dictionaryFile>
section of the HunspellXML file, we need to give a list of Lingala verbs. For example:
<dictionaryFile>
<words>
zal
ling
lamb
luk
kom
li
</words>
</dictionaryFile>
These dictionary words don't specify what affixation rules they are allowed to combine with, so the spell checker would only recognize these words: "zal", "ling", etc. But that isn't right, because "zal", "ling", etc. are just the bare verb roots, and those aren't valid words in Lingala.
We need to specify the affixation rules for these words.
We want our verb roots to combine with one of the four prefix rules, and we want the resulting word fragment to combine with one of the two suffix rules.
Dictionary Word | → (NA) |
SU | → (NA) |
ET |
→ (NA) |
SF IS IF |
→ (NA) |
ET EA |
One way to specify the affixation rules is to use the Hunspell method of specifying affixation rules: word/FlAg
. That is,
- The dictionary word
- A slash /
- A list of affixation rule flags.
- Two letter flags are placed right next to each other, e.g. NA, SU, SF, IS, IF becomes NASUSFISIF
- One letter flags are placed right next to each other, e.g. N, U, F, S, I becomes NUFSI
- Numeric flags are separated by commas, e.g. 93,12,9
- (Optionally: a tab followed by a list of morpheme rules. We're not going to deal with that here.)
We need to write the rules that connect the bare verb roots "zal", "ling", etc. to the affix rules that will lead to complete words. We can do that by specifying the affix rules for each word individually like this:
<dictionaryFile>
<words>
zal/NASUSFISIF
ling/NASUSFISIF
lamb/NASUSFISIF
luk/NASUSFISIF
kom/NASUSFISIF
li/NASUSFISIF
</words>
</dictionaryFile>
Or we can indicate that the same affix rules apply for every word inside the <words>...</words>
block lke this:
<dictionaryFile>
<!-- Verb roots -->
<words flags="NA SU SF IS IF">
zal
ling
lamb
luk
kom
li
</words>
<!-- Noun roots -->
<words flags="...">
...
</words>
</dictionaryFile>
This way you can specify the possible affixation rules just once per <words>...</words>
block. And separate each class or words into its own <words>...</words>
block.
In Lingala, you might create one block for regular verbs, one block for nouns of class 1/2, one block for nouns of class 3/4, and so on. In Spanish, you might create one block for each of the -ar, -er, and -ir verbs, several blocks for different kinds of noun plurals, and so on.
By adding these affixation rules to the word list, we specify which affixation rules may apply to the word in the dictionary. Only one of the rules may be chosen. And special rules like "NA" which we declared to stand for the "needAffix" rule will not be followed at all - they change how the other rules are applied. So in the case of our wordlist above, <words flags="NA SU SF IS IF">
means that one of the following affixation rules must be applied before the word can be valid:
- SU
- SF
- IS
- IF
Once the dictionary word connects to the first affixation rule, that affixation rule can specify that additional rules may apply, using the <rule .../>
element's combineFlags
attribute. These rules can be optional or required also, depending on whether a "needAffix" flag is set.
In the rules we already created above, we used the combineFlags
attribute to connect the "SU" rule to the "ET" rule, and the "SF", "IS", and "IF" rules to the "EA" rule. We used the setting <needAffix flag='NA'/>
and added it to the "SU", "SF", "IS", and "IF" combineFlags
attribute. That means that the word can never stop at just the "SU", "SF", "IS", or "IF" rule. Additional rules must be processed. (i.e. just like the bare verb root "zal" is not valid by itself, neither is "nazal" or "kozal". A suffix needs to be added, e.g. "nazala".)
A sample HunspellXML file that uses these affix rule can be found here.