HunspellXML Format (Affixes) - TrnsltLife/HunspellXML GitHub Wiki
HunspellXML Format > AffixFile > Affixes
The <affixes>...</affixes>
element contains all the morphological rules for the dictionary governing how prefixes and suffixes may attach to different classes of words. These rules are separated into two types: <prefix>
rules and <suffix>
rules.
<affixes>
<prefix flag="[flag]" cross="[boolean]">
<rule add="[text]" remove="[text]" where="[regex]" morph="[morph]" combineFlags="[list of flags]" />
<rule ... />
<rule ... />
</prefix>
<prefix flag="[flag]" cross="[boolean]">
<multiply>
<group>
<rule ... />
<rule ... />
<rule ... />
</group>
<group>
<rule ... />
<rule ... />
<rule ... />
<rule ... />
</group>
</multiply>
</prefix>
<suffix flag="[flag]" cross="[boolean]">
<rule ... />
<rule ... />
<rule ... />
<rule ... />
</suffix>
</affixes>
Attributes:
- flag [flag] - The flag that identifies this set of prefix or suffix rules
- cross [boolean] optional - Indicates whether this affix can combine with the opposite kind of affix (i.e. prefix with suffix or vice versa)
Each <prefix>
or <suffix>
has a required flag attribute that identifies the group of prefix or suffix rules that are grouped inside it. The words in the dictionary may contain a reference to this flag, which indicates that the word may combine with the prefixes or suffixes defined by the rules inside the <prefix>
or <suffix>
element.
For example, given a word in the dictionary:
<dictionaryFile>
<words flags="NS"><!--these words will combine with the NS suffix rules below-->
dog
cat
ostrich
fish
octopus
fox
</words>
</dictionaryFile>
and given the following affix rules:
<affixes>
<suffix flag="NS">
<rule add="s" where="[^hsx]" /> <!--add 's' to words not ending in 'h', 's', or 'x'-->
<rule add="es" where="[hsx]" /> <!--add 'es' to words ending in 'h', 's', or 'x'-->
</suffix>
</affixes>
The set of noun pluralization suffix rules identified by the flag NS allows the words in the dictionary signed with the NS flag to add -s and -es suffixes, so that all of the following words would be recognized by Hunspell:
dog dogs
cat cats
ostrich ostriches
fish fishes
octopus octopuses
fox foxes
Note that since the word 'dog' does not end in 'h', 's', or 'x', it will never combine with the second NS rule to produce doges.
Attributes:
-
add [text] optional
-
remove [text] optional.
-
where [regex] optional
-
combineFlags [list of flags] optional
-
morph [text] optional
-
where: This is what Hunspell refers to as the "condition". Under what conditions does this rule match? If you omit the attribute or specify
where="."
, then the rule will always match, and theremove
andadd
rules will be applied to come up with another valid word-form. Otherwise, you can specify a simplified regular expression to decide what matches.- dot (.) matches any character. A prefix rule with
where=".n"
could match any one character followed by an "n", e.g. "an", "en", "in", "on", "sn", "un", etc. - a set of characters between [square brackets] matches any single one of those characters. So a prefix rule with
where="[iu]n"
would match words starting with "in" or "un" but not "on" or anything else. - a set of characters between [^square brackets] where the first character is a caret (^) matches any one character except for the characters inside the brackets. So a suffix rule with
where="[^hsx]"
would match a word ending in any character besides "h", "s", and "x".
- dot (.) matches any character. A prefix rule with
-
remove: If a rule matches the
where
condition, theremove
rule is applied next before theadd
rule is applied. If you omit the attribute or put a value ofremove=""
orremove="0"
, nothing will be removed. Otherwise, the characters you specify will be removed from the beginning of the word (prefix rules) or from the end of the word (suffix rules). This does not use regular expressions. Specify a single affix to add onto the word-form. -
add: This indicates what text should be added to the end (for suffixes) or to the beginning (for prefixes) of the word.
-
combineFlags: If the rule matches, the
combineFlags
attribute indicates what other affixation rules may apply after this one. If you omit this attribute or leave its value blank, no other rules will apply and the currently applied rule will be a final word-form with no additional prefixes or suffixes possible. -
morph: Specify the morphological information that should be attached to this word when the
where
condition matches.
Here is a set of prefix rules that differentiate between different past tense spellings for regular English verbs:
<prefix flag="ED">
<rule where="e" add="d" morph="is:past"/>
<rule where="[^aeiou]y" remove="y" add="ied" morph="is:past"/>
<rule where="[^ey]" remove="0" add="ed" morph="is:past"/>
<rule where="[aeiou]y" remove="" add="ed" morph="is:past"/>
</prefix>
<rule where="e" add="d" morph="is:past"/>
In this example, the first rule matches only words that end in "e" (where="e"
). Nothing gets removed (there is no remove
attribute) and a "d" gets added to the end, along with the morphological tag "is:past". If the original word was "shade", this suffix rule would result in "shaded".
<rule where="[^aeiou]y" remove="y" add="ied" morph="is:past"/>
The next rule matches only words that end in "y" preceded by any letter but "aeiou". Or put another way, words that end in [consonant]+y such as "rally" and "bully". Remember, [aeiou]
is the regular expression for "any one letter in the list a,e,i,o,u". But in this cases, with the addition of a caret, [^aeiou]
is the regular expression code for "any one letter except for a,e,i,o,u". The removal rule remove="y"
means "remove the last character if it is 'y'". That would transform "rally" and "bully" into "rall" and "bull". The rule add="ied"
then adds "ied" onto the end, resulting in "rallied" and "bullied".
<rule where="[^ey]" remove="0" add="ed" morph="is:past"/>
The first two rules have covered cases where the word ends in "e" and in "y". This rule matches all words that don't end in "e" or "y", using the matching rule where="[^ey]"
. For words that match, nothing is removed remove="0"
, and "ed" is added. So a verb like "post" would match and be transformed into "posted".
<rule where="[aeiou]y" remove="" add="ed" morph="is:past"/>
The final rule, in contrast to the first rule, looks for matches where the word ends in a vowel (a,e,i,o,u) + y. Words that match include "play", "buoy", "prey", etc. The remove=""
rule means nothing is removed from the end of the word. The add rule adds "ed", resulting in words like "played", "buoyed", "preyed", etc.
Morphological description fields should consist of a two-letter code followed by a colon : followed by a text label.
- Multiple morphological description fields may be used. They are separated from each other by spaces.
- Morphological information is used for parsing and is not needed for spell checking.
- The morphological field codes that Hunspell defines are:
- ph: Phonetic
- st: Stem
- al: Allomorph(s)
- is: Inflectional suffix(es)
- ts: Terminal suffix(es)
- sp: Surface prefix
- pa: Parts of the compound words
- dp: Derivational prefix
- ip: Inflectional prefix
- tp: Terminal prefix
TODO
For the time being, see the Lingala Verb Example for examples on the use of the <multiply><group>...</group></multiply>
elements.