HunspellXML Format (Suggestions) - TrnsltLife/HunspellXML GitHub Wiki
HunspellXML Format > AffixFile > Suggestions
The <suggestions>
element contains settings and information designed to allow Hunspell to make better guesses when suggesting what word the user may have misspelled. It can contain the following elements, all of which are optional.
<suggestions>
<tryChars capitalize="[boolean]">[list of chars]</tryChars>
<keyboard>...</keyboard>
<phone>...</phone>
<replacements>...</replacements>
<mappings>...</mappings>
<noSuggestions flag="[flag]"/>
<warn flag="[flag]"/>
<forbidWarn/>
<maxCompoundSuggestions>[integer]</maxCompoundSuggestions>
<maxNGramSuggestions>[integer]</maxNGramSuggestions>
<maxDifference>[integer:0-10]</maxDifference>
<onlyMaxDifference/>
<noSplitSuggestions/>
<suggestionsWithDots/>
</suggestions>
Attributes:
- capitalize [boolean] optional
These are the characters that Hunspell should use to try to find correctly spelled words that are only one letter different from the misspelled word. Put them in the order of frequency, if possible. Use the capitalize="true" attribute to specify letters in just lowercase and have the uppercase letters added as well. Each item in the list should be only one character in length, separated by spaces.
Attributes:
- layout [QWERTY|AZERTY|Dvorak] optional
Specify a list of neighboring keys on the keyboard. Non-neighbors are separated from each other by a vertical line. Hunspell documentation:
Hunspell searches and suggests words with one different character replaced by a neighbor KEY character. Not neighbor characters in KEY string separated by vertical line characters. Suggested KEY parameters for QWERTY and Dvorak keyboard layouts: KEY qwertyuiop|asdfghjkl|zxcvbnm KEY pyfgcrl|aeouidhtns|qjkxbmwvz Using the first QWERTY layout, Hunspell suggests "nude" and "node" for "*nide". A character may have more neighbors, too: KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
Example:
<keyboard>qwertyuiop|asdfghjkl|zxcvbnm</keyboard>
You can also use one of the prebuilt keyboard layouts by specifying the layout attribute:
- QWERTY
- AZERTY
- Dvorak
<keyboard layout="QWERTY"/>
Hunspell documentation:
PHONE uses a table-driven phonetic transcription algorithm borrowed from Aspell. It is useful for languages with not pronunciation based orthography. You can add a full alphabet conversion and other rules for conversion of special letter sequences. For detailed documentation see http://aspell.net/man-html/Phonetic-Code.html or reproduced on this wiki. Note: Multibyte UTF-8 characters have not worked with bracket expression yet. Dash expression has signed bytes and not UTF-8 characters yet.
In HunspellXML, the <phone>...</phone>
element should contain a set of <rule>[text]</rule>
elements. Each rule should contain a phone rule formatted according to the ASpell rules.
Here are a couple of rules for English from the Hunspell tests/phone.aff file:
<phone>
<rule>ER(AEIOUY)-^ *R </rule>
<rule>GH(AEIOUY)- K </rule>
</phone>
In HunspellXML, the <replacements>...</replacements>
element should contain a set of <replace .../>
elements.
Attributes:
- from [text] - the character(s) to replace
- to [text] - what to replace them with
- reverse [boolean] optional - whether the reverse rule should be generated automatically
Hunspell documentation:
With this table, Hunspell can suggest the right forms for the typical faults of spelling when the incorrect form differs by more, than 1 letter from the right form. The search string supports the regex boundary signs (ˆ and $). For example a possible English replacement table definition to handle misspelled consonants:
<replacements>
<replace from="f" to="ph" reverse="true"/>
<replace from="tion$" to="shun"/>
<replace from="^cooccurr" to="co-occur"/>
<replace from="^alot$" to="a_lot"/>
</replacements >
Notes:
- You can use underscore _ in the "to" attribute to suggest words separated by a space.
- The reverse="true" attribute allows a shorthand., i.e. the following two examples are equivalent:
<replacements>
<replace from="f" to="ph"/> (multiple)
<replace from="ph" to="f"/> (multiple)
</replacements>
<replacements>
<replace from="f" to="ph" reverse="true"/>
</replacements>
Hunspell documentation:
We can define language-dependent information on characters and character sequences that should be considered related (i.e. nearer than other chars not in the set) in the affix file (.aff) by a map table. With this table, Hunspell can suggest the right forms for words, which incorrectly choose the wrong letter or letter groups from a related set more than once in a word.
See <replacements>
.
In HunspellXML, the <mappings>...</mappings>
element should contain a set of <map>[text]</map>
elements. Each <map>[text]</map>
should contain a list of characters or characters sequences. Each character or character sequence should be separated from the next character by a space. Here is an example for English "f" and "ee" sounds.
<mappings>
<map>f ph gh</map> (multiple)
<map>e ea ee ey y i</map> (multiple)
</mappings>
The <mappings>
element can be especially useful for languages that use a lot of diacritics on vowels (like tone marks, nasal marks, accent marks, etc.)
Words signed with the flag defined in <noSuggestions.../>
will never be suggested to the user as a proper spelling for a misspelled word. You can apply this flag to vulgar and obscene words to prevent them from being suggested to the user. See also the <subStandard.../>
option under <settings>
.
Hunspell documentation:
This flag is for rare words which are also often spelling mistakes.
Hunspell documentation:
Words with flag WARN aren't accepted by the spell checker using this parameter.
Hunspell documentation:
Set max. number of suggested compound words (generated by compound rules). (The number of the suggested compound words may be greater from the same 1-character distance type.)
Hunspell documentation:
Set max. number of n-gram suggestions. Value 0 switches off the n-gram suggestions (see also MAXDIFF).
Hunspell documentation:
Set the similarity factor for the n-gram suggestions (5 = default value, 0 = few, but min. 1, 10 = MAXNGRAMSUGS n-gram suggestions).
Hunspell documentation:
Removing all bad ngram suggestions is allowed (default mode keeps one, see MAXDIFF).
Hunspell documentation:
Disable split-word suggestions.
Hunspell documentation:
Add dot(s) to suggestions, if input word terminates in dot(s). (Not for OpenOffice.org dictionaries, because OpenOffice.org has an automatic dot expansion mechanism.)