Discussion on (FreeDict) TEI

Notes:

Points 1-9 are directly taken from an ML thread, numbered A1-A9 there.
- Points 10-X also stem from around that thread.

1) TEI Lex-0.

Questions
- What about the TEI Lex-0 standard?
- Should it be followed?
Examples
- a) <gram type="gender"/> instead of <gen/>.
- b) <usg> with @type (and possibly @norm)
Potential advantages
- good, fixed list of usg types (see this comparison table)
  - The useful @types textType and attribute have no equivalents in the TEI Guidelines' suggested values.
    - textType examples: bibl., poet., admin., journalese
    - attribute examples: derog., euph.
- Requirement to fully annotate with @xml:id and @xml:id
Further questions:
- Should textType and attribute just be borrowed from TEI Lex-0?
- Where to annotate with @xml:id and @xml:lang?
Answers
- The FreeDict conversion style sheets do not support TEI Lex-0. (FreeDict TEI is in parts incompatible with TEI Lex-0)
- "It all boils down to somebody reading the document, defining our specific requirements and potentially modification and implementing it." / @shumenda
- The TEI Lex-0 guidelines may be used in addition wherever they do not contradict the FreeDict or TEI guidelines.
- TEI Lex-0 is meant to encode retrodigitized dictionaries including presentational information, while FreeDict TEI is not concerned with such.
- Consider to someday switch to another (related) standard: ISO LMF-4
  - No public information yet.
  - ISO standard is not available for free
  - There is a skeletal example document
See also: this thread on the mailing list.

2) [answered] Verb & Transitivity annotation.

Status quo
- In a HowTo, it is suggested to use v,vt,vi,vti, i.e., merge all such information into a single token.
- In an example, there is "vtr", which would also adhere to TEI Lex-0, in contrast to the former.
Questions: How to annotate transitivity information?
Answer: The use of subc is strongly recommended.

3) [answered] IPA Pronunciation.

Question: How can I enrich my dictionary with pronunciation, as annotated in <pron> tags?
Answer: Unless present, the standard build process, using make, adds phonetics information using the teiaddphonetics script (which internally usese speak[-ng]).

4) Normalization of usage annotations

Question: Should usage annotations (the content of <usg> tags) be normalized?
- different languages (e.g. "[Sprw.]" ~ "[prov.]")
- same language (e.g. "[coll.]" ~ "[slang]")
Notes:
- Recommended by TEI Lex-0.
- The usage of @norm in <usg> might render this less an issue.
Sub-questions
- Should they be normalised to a single label?
- Should they be normalised to some standard labels?
  - ISO 12620 (cf. Wikipedia:Registers) (full standard only commercially available)
Answers
- An ontology should be defined.
  - Questions:
    - Similar to / linked to shared/FreeDict_ontology.xml?
      - This seems to only allow linking equivalent annotations in different languages, however not "coll." and "slang" (if these should even be considered equivalent).
    - Where to find documentation on writing such an ontology?

5) Quantified (or similar) usage annotations

Examples
- "mainly Am."
- "bes. Süddt.", "especially Am."
Question
- How to represent the determiner ("mainly", "bes.", ...)?
Notes
- TEI Lex-0 suggests a separate attribute, but not which (there is a TODO in the doc).
  - None of the <usg> annotations really fit, maybe @subtype?
Answer
- Likely the easiest: <usg type="hint">mainly Am.</usg>

6) Regional / dialect / language annotations.

classes of such annotation
- a) dialect
  - Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
  - distinction from b) partially unclear (e.g., "Am.")
- b) Region or country
  - Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
- c) Ex.: "[French]", "[Lat.]"
Questions
- How to annotate/distinguish the above classes?
Notes
- TEI Lex-0: usg[@type="geographic"]: "marker which identifies the place or region where a lexical unit is mainly used"
  - Matches b), potentially partly a).
Answers
- a), b): usg[@type="geo"]
- c): usg[@type="lang"]
  - See the TEI Guidelines's corresponding section.
- Alternatively: Craft new type and document in the header (usg types may be be freely chosen according to the TEI Guidelines.)
  - Also consider to adopt such a new type in the FreeDict guidelines.
  - Use plain text but name the tag and attribute name explicitly.
- Consider to use a list of languages (e.g., this).
Notes example (ps and lists are both fine):

<notesStmt>
  <note type="status">small</note> <!-- mandatory for our DB -->
  <note xml:lang="de"> <!-- can be freely chosen -->
    <list><item>blah</list>
  </note>
</notesStmt>

7) [answered] Abbreviations.

Cases
- a) Headwords, which are annotations.
  - rare
- b) Annotated on headwords.
Question: How to represent in TEI?
Notes
- THE TEI Guidelines contain an example with both <form type="abbrev"> and <form type="full">, in the same <entry>.
- The TEI Guidelines also offer <abbr> and <expan>, possibly grouped in <choice>.
  - These seem to be rather intended for encoding inside of prose.
Answers
- An entry should only contain a single form tag.
- An entry/form may contain a nested form[@type="abbrev"] element.
- In the case of a standalone abbreviation, the corresponding form element right below entry should be annotated with @type="abbrev".
  - potential issue: Shouldn't the topmost form elements have @type="lemma"?

8) [answered] entry/sense/gramGrp vs entry/gramGrp

Answer: Both are fine (also in parallel).
- Consider to put gramGrp inside form, when also in sense.

9) Header

9.1) fileDesc/publicationStmt/license

Question: Currently <availability> is suggested and used exclusively (for licensing information). Why not <license>?
Answer: The style sheets do not permit <license>, the validation would hence fail.
- Consider to change this in a future style sheet update.

9.2) [imported dictionaries] Date of imported source

Q: Where to annotate a date special to a source the final TEI was imported from.
A: Annotate within sourceDesc.
- Q: As plain text?

9.3) [answered] fileDesc/publicationStmt/pubPlace

HowTo: <ref>https://freedict.org/</ref>
(example) TEI: <ref target="http://freedict.org/">http://freedict.org/</ref>
A: The HowTo is right.

9.6) [answered] [imported dictionaries] fileDesc/editionStmt/edition (version)

Question: What to use when the TEI output is both influenced by a source's version and an importer's version?
Answers
- Whatever works or seems logical.
- Options: srcver.importerver | date | srcver | srcver.date

9.7) [answered] [imported dictionaries] fileDesc/titleStmt/editor

Q: Set author of importer as editor?
- TEI Guidelines: "[...] acting as editor, compiler, translator, etc."
A: Permitted.

10) [answered] Q: Should `xr/ref` have a content, or can a `@target` suffice.

A: Content!

11) Grouping of homographs

Options:
- superEntry/entry
- entry/sense
- entry/hom
- entry/entry
  - illegal in (FreeDict) TEI, suggested in TEI Lex-0.
Q: Is superEntry ok?
- A: No. "It doesn't seem necessary at all and is on its way out, in general." / @bansp
- A: Not handled by stylsheets. Also, hom is ignored.
Q: [imported dictionaries] What if it is not clear from the source whether two homographs qualify as senses of the same word?
- Note: The "Ding" dictionary contains many words repeatedly, usually with (close to) identical meaning.
- Q: If grouping, what to do with potentially differing annotations, including abbreviations, gramGrp, inflected forms?
  - Q: Only keep what applies to all on the top level?
  - Q: Are all the tags valid e.g. on the sense level?

12) [answered] Presentational information

Examples: "{v}" - the braces, ";", "~" - for references
A: Drop

13) [answered] "to" prefix for verbs

A: Drop.

14) [answered] Multiple genders

Ex.: "Avis {m,n}" (german)
A: Two <gen> in a single gramGrp.

16) [answered] Encoding of plain text annotations on headwords (and translations)

Examples:
- "bread (baked in an oven)"
- "bread (wheat product)"
Options:
- <note>
- <usg> -- @type="hint"?
  - Usually used for more specific usages, e.g. "Am.", "med.".
- <def>
Answers:
- [imported dictionary] When undistinguishable, use <note>
- When writing by hand, try to distinguish (def, usg with specific @type).

17) Collocates

Cases:
- a) case information: "wegen {+Gen.}"
  - see 17.2)
- b) auxiliary words representing an object
  - b.1) suffixing: "eat sth."
  - b.2) prefixing: "etw. essen"
  - b.3) alternatives: "notify sth./sb."
    - b.3.1) switchable words: "to file away <> sth." (indicating the alternatives "to file away sth." and "to file sth. away")
  - b.4) several: "give sth. to sb."
    - potentially both prefixing and suffixing
- c) specific word(s)
  - c.1) suffixing: "dismounting (of a machine)"
  - c.2) prefixing
  - c.3) combinations
- d) combinations of a), b), c)
Available tags
- <colloc> (occurs in <gramGrp>)
  - attribute @type="left"?
    - possible conflict with @type as suggested in 17.2).
- <usg type="colloc">
  - attribute @subtype="left"?
- <cit type="colloc">
  - Nested inside <cit type="trans">, seen in eng-pol.
    - See also 25.3)
Answers
- For a), see 17.2).
- b) colloc
- c) usg[@type="colloc"]
Proposed answers:
- b.i): <colloc>. This is grammar information.
- b.ii): @type or @subtype with value obj (or similar).
- c): <usg type="colloc">/<cit type="colloc">. This is not grammar information.
- location: @subtype="left" resp. "right".
- order: keep both <colloc> and <usg type="colloc"> (resp. cit) in the original order.
  - Keeping the order of the union of both is impossible with the given suggestion, but things like "(of a machine)" are supposed to be optional anyways.
- b.3) (alternatives)
  - i) group in <choice> or similar.
  - ii-iv) see below
  - v) Use @n to define an order. Interchangeable collocates get the same `@n'.
  - iii) conflicts with several subsequent <colloc>s

<form><!-- ii) -->
  <orth>notify</orth>
  <gramGrp><colloc>sth.</colloc></gramGrp>
  <form type="alternate">
     <orth>notify</orth>
     <gramGrp><colloc>sb.</colloc></gramGrp>
  </form>
</form>
<!-- OR iii) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth.</colloc>
    <colloc>sb.</colloc>
  </gramGrp>
</form>
<!-- OR iv) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth./sb.</colloc>
  </gramGrp>
</form>

17.2) collocates' case/pos // was 15)

How to encode "{+Gen.}", indicating that an object in the genitive case should follow?
- Special case: "{wo?, wann? +Dat.}" -- further enriched with corresponding interrogative pronoun(s)
- Similarly for POS: "{+conj}"
Option:
- <colloc>[+ Gen.]</colloc> (where "Gen." might be changed to something else)
  - Derived from TEI Lex-0
  - [] is not very nice.
  - Likely use a non-language-specific case-abbreviation (i.e., "gen")
- <colloc type="case">
  - Would require a corresponding type for regular collocates, such as in the TEI GUidelines' example "médire de".
    - Option: @type="plain".
See also: 17), in particular @type="left".

18) Grouping of annotations

Consider "[formal/Am.]" vs. "[formal] [Am.]".
- The former indicates a disjunction, the latter a conjunction of the two annotations.
- Also possible with grammar annotations.
Q: How to differentiate?
Options:
- a) Don't.
- b) For grammar annotations: Several gramGrps.
- c) Literal retaining of the slash (or similar separator).
  - May forbid to set a common @type (such as in the example above).
- d) Something like <choice> for disjunctions.

19) Q: Which content should grammar elements have?

Options
- Short english forms from shared/FreeDict_ontology.xml
- Anything, but link to that ontology, as done in eng-pol.tei.

20) Alternatives in a headword or translation (</>)

Example: "biological breakdown/degradation"
Q: How to encode
Options:
- literally
- derive two distinct headwords/translations
  - headwords:
    - link with xr/ref
    - sub-form with @type="alternate" or similar.
  - translations: separare cit elements
- Something else (e.g. something like choice)
  - likely only an option for translations.

22) Q: Are entries without translations permitted?

A: Only if they contain any information within a sense, such as a reference (<ref>).
- only gramGrp or inflected forms are insufficient.

23) Q: What about several `subc`?

Cases
- a) same main part: "v/trans" + "v/intr"
  - Example: "essen {vt;vi}"
- b) different main part (awkward): "v/trans" + "pron/rel"
Options
- a.1) One pos followed by several subc.
- *.2) Two pairs of pos, subc
- *.3) two gramGrp
- *.4) only (two) pos, content e.g. "vt".
- a.5) `trans/intr

24) [answered] `form` `@type`: `infl` vs. `inflected`

Status quo
- ML, Wiki, lg1-lg2.tei: infl
- TEI Guidelines, TEI Lex-0: inflected
A: infl
- FreeDict-TEI specific
- Consider to change someday.

25.1) [answered] `usg` inside `form[@type="inflected"]`?

A: OK.

25.2) `usg` inside `colloc`?

Not permitted by the TEI Guidelines.
- Neither is usg inside usg (where the latter might have @type="colloc").
Example (from Ding): "{prp; +Gen.; +Dat. [ugs.]}"
- See also 17.2) on why "+Dat." becomes a colloc element.

25.3) [partially answered] Which information to annotate to translations (`<cit type="trans" />`)?

Possible annotations
- [answered] usg
  - Depending on @type?
  - Q: Use nested cit instead?
    - eng-pol has e.g.: <cit type="colloc">
- gramGrp
  - Q: Exclude information that can be safely derived from the corresponding source language's gramGrp?
- colloc -- probably yes
- note
  - Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
    - Suggestion: first two () become <note>s inside entry/sense, the last one a <note> inside <cit type="trans">.
- [answered] Abbreviations
  - How?
- inflected forms // was 26)
  - Likely yes.
  - How?
- [answered] examples // was 21)
  - (It's common to have an example for a headword, together with a translation.)
  - (Question is, what about examples particular to the translation.)
  - Likely realisation: <cit type="trans"><quote /><cit type="example" /></cit>
Answers
- Anything that is valid TEI is OK.
- abbreviations: cit[@type="abbrev"]
- examples: options:
  - a) Even if particular to the translation, keep on the <sense> level.
    - a.1) <cit type="example"><quote xml:lang="SRCLANG" /><quote xml:lang="TGTLANG" /></cit>
    - a.2) <cit type="example"><quote xml:lang="SRCLANG" /><cit type="trans" xml:lang="TGTLANG"><quote xml:lang="TGTLANG" /></cit></cit>
    - (There may be several more quote elements.)
  - b) Add inside <cit type="trans">, next to the <quote> element.
    - Translation in the source language may be added, within a nested <cit type="trans">, like in a.2).

27) Singulare/plurale tantum

Such is a noun that only occurs in singular or plural form, respectively.
Q: How to encode?
Likely: <num>pl</num><subc>no sg</subc> (plurale tantum)

discussion TEI - freedict/fd-dictionaries GitHub Wiki

Discussion on (FreeDict) TEI

Notes:

1) TEI Lex-0.

2) [answered] Verb & Transitivity annotation.

3) [answered] IPA Pronunciation.

4) Normalization of usage annotations

5) Quantified (or similar) usage annotations

6) Regional / dialect / language annotations.

7) [answered] Abbreviations.

8) [answered] entry/sense/gramGrp vs entry/gramGrp

9) Header

9.1) fileDesc/publicationStmt/license

9.2) [imported dictionaries] Date of imported source

9.3) [answered] fileDesc/publicationStmt/pubPlace

9.6) [answered] [imported dictionaries] fileDesc/editionStmt/edition (version)

9.7) [answered] [imported dictionaries] fileDesc/titleStmt/editor

10) [answered] Q: Should `xr/ref` have a content, or can a `@target` suffice.

11) Grouping of homographs

12) [answered] Presentational information

13) [answered] "to" prefix for verbs

14) [answered] Multiple genders

16) [answered] Encoding of plain text annotations on headwords (and translations)

17) Collocates

17.2) collocates' case/pos // was 15)

18) Grouping of annotations

19) Q: Which content should grammar elements have?

20) Alternatives in a headword or translation (</>)

22) Q: Are entries without translations permitted?

23) Q: What about several `subc`?

24) [answered] `form` `@type`: `infl` vs. `inflected`

25.1) [answered] `usg` inside `form[@type="inflected"]`?

25.2) `usg` inside `colloc`?

25.3) [partially answered] Which information to annotate to translations (`<cit type="trans" />`)?

27) Singulare/plurale tantum

⚠️ GitHub.com Fallback ⚠️

discussion TEI - freedict/fd-dictionaries GitHub Wiki

Discussion on (FreeDict) TEI

Notes:

1) TEI Lex-0.

2) [answered] Verb & Transitivity annotation.

3) [answered] IPA Pronunciation.

4) Normalization of usage annotations

5) Quantified (or similar) usage annotations

6) Regional / dialect / language annotations.

7) [answered] Abbreviations.

8) [answered] entry/sense/gramGrp vs entry/gramGrp

9) Header

9.1) fileDesc/publicationStmt/license

9.2) [imported dictionaries] Date of imported source

9.3) [answered] fileDesc/publicationStmt/pubPlace

9.6) [answered] [imported dictionaries] fileDesc/editionStmt/edition (version)

9.7) [answered] [imported dictionaries] fileDesc/titleStmt/editor

10) [answered] Q: Should xr/ref have a content, or can a @target suffice.

11) Grouping of homographs

12) [answered] Presentational information

13) [answered] "to" prefix for verbs

14) [answered] Multiple genders

16) [answered] Encoding of plain text annotations on headwords (and translations)

17) Collocates

17.2) collocates' case/pos // was 15)

18) Grouping of annotations

19) Q: Which content should grammar elements have?

20) Alternatives in a headword or translation (</>)

22) Q: Are entries without translations permitted?

23) Q: What about several subc?

24) [answered] form @type: infl vs. inflected

25.1) [answered] usg inside form[@type="inflected"]?

25.2) usg inside colloc?

25.3) [partially answered] Which information to annotate to translations (<cit type="trans" />)?

27) Singulare/plurale tantum

⚠️ **GitHub.com Fallback** ⚠️

10) [answered] Q: Should `xr/ref` have a content, or can a `@target` suffice.

23) Q: What about several `subc`?

24) [answered] `form` `@type`: `infl` vs. `inflected`

25.1) [answered] `usg` inside `form[@type="inflected"]`?

25.2) `usg` inside `colloc`?

25.3) [partially answered] Which information to annotate to translations (`<cit type="trans" />`)?

⚠️ GitHub.com Fallback ⚠️