HunspellXML Format (Compounds) - TrnsltLife/HunspellXML GitHub Wiki

HunspellXML   HunspellXML Format > AffixFile > Compounds


<compounds>...</compounds>

The <compounds>...</compounds> element and its contents give information about acceptable word compound formation in the language. See the Hunspell documentation for more information. All of these elements are optional.

<compounds>
	<breakChars>[list of text]</breakChars>
	<compoundRules>...</compoundRules>
	<compoundPatterns>...</compoundPatterns>
	<compoundMin>[integer]</compoundMin>
	<compoundWordMax>[integer]</compoundWordMax>
	<compoundSyllable max="[integer]" vowels="aeiou"/>
	<syllableNum flags="[list of flags]"/>
	<compound flag="[flag]"/>
	<compoundBegin flag="[flag]"/>
	<compoundMiddle flag="[flag]"/>
	<compoundLast flag="[flag]"/>
	<onlyInCompound flag="[flag]"/>
	<compoundPermit flag="[flag]"/>
	<compoundForbid flag="[flag]"/>
	<compoundRoot flag="[flag]"/>
	<checkCompoundDuplicates/>
	<checkCompoundReplacements/>
	<checkCompoundCase/>
	<checkCompoundTriple/>
	<simplifiedTriple/>
	<forceUpperCase/>
</compounds >

<breakChars>[list of text]</breakChars>

Attributes:

  • off [boolean] optional

Hunspell documentation:

Define new break points for breaking words and checking word parts separately. Use ˆ and $ to delete characters at end and start of the word. Rationale: useful for compounding with joining character or strings (for example, hyphen in English and German or hyphen and n-dash in Hungarian). Dashes are often bad break points for tokenization, because compounds with dashes may contain not valid parts, too.) With BREAK, Hunspell can check both side of these compounds, breaking the words at dashes and n-dashes.

Each character or character sequence goes in a child element called <chars>...</chars>.

<breakChars>
	<chars>-</chars>
	<chars>--</chars>
</breakChars>

or

<breakChars>
	<chars>-</chars>
	<chars>ˆ-</chars>
	<chars>-$</chars>
</breakChars>

If you specify <breakChars off="true"/> instead of a list of characters, break detection will be turned off (like Hunspell's BREAK 0 command).

<breakChars off="true"/>

<compoundRules>...</compoundRules>

Hunspell documentation:

Define custom compound patterns with a regex-like syntax. The first COMPOUNDRULE is a header with the number of the following COMPOUNDRULE definitions. Compound patterns consist compound flags, parentheses, star and question mark meta characters. A flag followed by a ‘’ matches a word sequence of 0 or more matches of words signed with this compound flag. A flag followed by a ‘?’ matches a word sequence of 0 or 1 matches of a word signed with this compound flag. See tests/compound.* examples. Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th, 1000122nd etc.)." Note II: In the case of long and numerical flag types use only parenthesized flags: (1500)*(2000)? Note III: COMPOUNDRULE flags haven’t been compatible with the COMPOUNDFLAG, COMPOUNDBEGIN, etc. compound flags yet (use these flags on different words)

<compoundRules>
	<rule>(VB)?(NS)*</rule>
	<rule>(AX)*</rule>
</compoundRules>

<compoundPatterns>...</compoundPatterns>

The <compoundPatterns>...</compoundPatterns> element contains a set of <pattern .../> elements that serve to limit the words that are allowed to form compound.

Hunspell documentation:

Forbid compounding, if the first word in the compound ends with endchars, and next word begins with startchars and (optionally) they have the requested flags. The optional replacement parameter allows simplified compound form. The special "endchars" pattern 0 (zero) limits the rule to the unmodified stems (stems and stems with zero affixes): Note: COMPOUNDMIN doesn’t work correctly with the compound word alternation, so it may need to set COMPOUNDMIN to lower value.

<pattern .../>

Attributes:

  • endChars [text] required
  • endFlags [list of flags] optional
  • startChars [text] required
  • startFlags [list of flags] optional
  • replacement [text] optional
<compoundPatterns>
	<pattern endChars="0" endFlags="[list of flags]" startChars="abc" startFlags="[list of flags]" replacement="efg"/>
	<pattern endChars="xyz" startChars="pdq"/>
</ compoundPatterns>
⚠️ **GitHub.com Fallback** ⚠️