HunspellXML Format (ThesaurusFile) - TrnsltLife/HunspellXML GitHub Wiki
HunspellXML Format > ThesaurusFile
UNDER CONSTRUCTION
The <thesaurusFile>...</thesaurusFile>
section of the HunspellXML file can be used to build a MyThes thesaurus file which is used in OpenOffice and LibreOffice to provide synonym suggestions. HunspellXML gives you three ways to add synonyms into the MyThes file.
-
<include .../>
- Put the list of synonyms in a separate file in MyThes format. -
<entries>...</entries>
- Put a list of synonyms in MyThes format directly inside the<entries>...</entries>
block(s). -
<entry><synonyms>[list of synonyms]</synonyms></entry>
- Create the list of synonyms in an XML format instead of a MyThes format.
<thesaurusFile>
<include file="my-thesaurus-file.txt"/>
<include file="my-thesaurus-file2.txt"/>
<entries>
[MyThes data...]
</entries>
<entry word="...">
<synonyms info="...">
word
word
...
</synonyms>
<synonyms info="...">
<s>word</s>
<s>word</s>
...
</synonyms>
</entry>
<entry word="...">
...
</entry>
</thesaurusFile>
Before we look more in depth at the three options for listing synonyms, we need to understand the MyThes data format. Here's what MyThes's data_layout.txt file has to say about the MyThes data format:
- All of the remaining lines of the file follow this structure
entry|num_mean
pos|syn1_mean|syn2|...
.
.
.
pos|mean_syn1|syn2|...
where:
entry - all lowercase version of the word or phrase being described
num_mean - number of meanings for this entry
There is one meaning per line and each meaning is comprised of
pos - part of speech or other meaning specific description
syn1_mean - synonym 1 also used to describe the meaning itself
syn2 - synonym 2 for that meaning etc.
To make this even more clear, here is actual data for the
entry "simple".
simple|9
(adj)|simple |elemental|ultimate|oversimplified|simplistic|simplex|simplified|unanalyzable|undecomposable|uncomplicated|unsophisticated|easy|plain|unsubdivided
(adj)|elementary|uncomplicated|unproblematic|easy
(adj)|bare|mere|plain
(adj)|childlike|wide-eyed|dewy-eyed|naive |naif
(adj)|dim-witted|half-witted|simple-minded|retarded
(adj)|simple |unsubdivided|unlobed|smooth
(adj)|plain
(noun)|herb|herbaceous plant
(noun)|simpleton|person|individual|someone|somebody|mortal|human|soul
It says that "simple" has 9 different meanings and each
meaning will have its part of speech and at least 1 synonym
with other if present following on the same line.
The <entries>...</entries>
tags should contain multiple lines of text, formatted according to the MyThes format.
<entries>
...
simple|9
(adj)|simple |elemental|ultimate|oversimplified|simplistic|simplex|simplified|unanalyzable|undecomposable|uncomplicated|unsophisticated|easy|plain|unsubdivided
(adj)|elementary|uncomplicated|unproblematic|easy
(adj)|bare|mere|plain
(adj)|childlike|wide-eyed|dewy-eyed|naive |naif
(adj)|dim-witted|half-witted|simple-minded|retarded
(adj)|simple |unsubdivided|unlobed|smooth
(adj)|plain
(noun)|herb|herbaceous plant
(noun)|simpleton|person|individual|someone|somebody|mortal|human|soul
...
</entries>
There can be multiple <entries>...</entries>
sections and they will all be stitched together to form the final thesaurus data.
<entry>...</entry>
The <include .../>
element instructs HunspellXML to open an external file and load all its MyThes rules (in the MyThes forma listed above) into the synonym list that will be used to create the MyThes .dat file. Anything that can go in a <entries>...</entries>
block can go in the external file.
<include file="lin_synonyms.txt"/>