discussion TEI - freedict/fd-dictionaries GitHub Wiki

Discussion on (FreeDict) TEI

Notes:

  • Points 1-9 are directly taken from an ML thread, numbered A1-A9 there.
    • Points 10-X also stem from around that thread.

1) TEI Lex-0.

  • Questions

    • What about the TEI Lex-0 standard?
    • Should it be followed?
  • Examples

    • a) <gram type="gender"/> instead of <gen/>.
    • b) <usg> with @type (and possibly @norm)
  • Potential advantages

    • good, fixed list of usg types (see this comparison table)
      • The useful @types textType and attribute have no equivalents in the TEI Guidelines' suggested values.
        • textType examples: bibl., poet., admin., journalese
        • attribute examples: derog., euph.
    • Requirement to fully annotate with @xml:id and @xml:id
  • Further questions:

    • Should textType and attribute just be borrowed from TEI Lex-0?
    • Where to annotate with @xml:id and @xml:lang?
  • Answers

    • The FreeDict conversion style sheets do not support TEI Lex-0. (FreeDict TEI is in parts incompatible with TEI Lex-0)
    • "It all boils down to somebody reading the document, defining our specific requirements and potentially modification and implementing it." / @shumenda
    • The TEI Lex-0 guidelines may be used in addition wherever they do not contradict the FreeDict or TEI guidelines.
    • TEI Lex-0 is meant to encode retrodigitized dictionaries including presentational information, while FreeDict TEI is not concerned with such.
    • Consider to someday switch to another (related) standard: ISO LMF-4
  • See also: this thread on the mailing list.

2) [answered] Verb & Transitivity annotation.

  • Status quo

    • In a HowTo, it is suggested to use v,vt,vi,vti, i.e., merge all such information into a single token.
    • In an example, there is "vtr", which would also adhere to TEI Lex-0, in contrast to the former.
  • Questions: How to annotate transitivity information?

  • Answer: The use of subc is strongly recommended.

3) [answered] IPA Pronunciation.

  • Question: How can I enrich my dictionary with pronunciation, as annotated in <pron> tags?

  • Answer: Unless present, the standard build process, using make, adds phonetics information using the teiaddphonetics script (which internally usese speak[-ng]).

4) Normalization of usage annotations

  • Question: Should usage annotations (the content of <usg> tags) be normalized?

    • different languages (e.g. "[Sprw.]" ~ "[prov.]")
    • same language (e.g. "[coll.]" ~ "[slang]")
  • Notes:

    • Recommended by TEI Lex-0.
    • The usage of @norm in <usg> might render this less an issue.
  • Sub-questions

  • Answers

    • An ontology should be defined.
      • Questions:
        • Similar to / linked to shared/FreeDict_ontology.xml?
          • This seems to only allow linking equivalent annotations in different languages, however not "coll." and "slang" (if these should even be considered equivalent).
        • Where to find documentation on writing such an ontology?

5) Quantified (or similar) usage annotations

  • Examples

    • "mainly Am."
    • "bes. Süddt.", "especially Am."
  • Question

    • How to represent the determiner ("mainly", "bes.", ...)?
  • Notes

    • TEI Lex-0 suggests a separate attribute, but not which (there is a TODO in the doc).
      • None of the <usg> annotations really fit, maybe @subtype?
  • Answer

    • Likely the easiest: <usg type="hint">mainly Am.</usg>

6) Regional / dialect / language annotations.

  • classes of such annotation

    • a) dialect
      • Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
      • distinction from b) partially unclear (e.g., "Am.")
    • b) Region or country
      • Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
    • c) Ex.: "[French]", "[Lat.]"
  • Questions

    • How to annotate/distinguish the above classes?
  • Notes

    • TEI Lex-0: usg[@type="geographic"]: "marker which identifies the place or region where a lexical unit is mainly used"
      • Matches b), potentially partly a).
  • Answers

    • a), b): usg[@type="geo"]
    • c): usg[@type="lang"]
    • Alternatively: Craft new type and document in the header (usg types may be be freely chosen according to the TEI Guidelines.)
      • Also consider to adopt such a new type in the FreeDict guidelines.
      • Use plain text but name the tag and attribute name explicitly.
    • Consider to use a list of languages (e.g., this).
  • Notes example (ps and lists are both fine):

<notesStmt>
  <note type="status">small</note> <!-- mandatory for our DB -->
  <note xml:lang="de"> <!-- can be freely chosen -->
    <list><item>blah</list>
  </note>
</notesStmt>

7) [answered] Abbreviations.

  • Cases

    • a) Headwords, which are annotations.
      • rare
    • b) Annotated on headwords.
  • Question: How to represent in TEI?

  • Notes

    • THE TEI Guidelines contain an example with both <form type="abbrev"> and <form type="full">, in the same <entry>.
    • The TEI Guidelines also offer <abbr> and <expan>, possibly grouped in <choice>.
      • These seem to be rather intended for encoding inside of prose.
  • Answers

    • An entry should only contain a single form tag.
    • An entry/form may contain a nested form[@type="abbrev"] element.
    • In the case of a standalone abbreviation, the corresponding form element right below entry should be annotated with @type="abbrev".
      • potential issue: Shouldn't the topmost form elements have @type="lemma"?

8) [answered] entry/sense/gramGrp vs entry/gramGrp

  • Answer: Both are fine (also in parallel).
    • Consider to put gramGrp inside form, when also in sense.

9) Header

9.1) fileDesc/publicationStmt/license

  • Question: Currently <availability> is suggested and used exclusively (for licensing information). Why not <license>?

  • Answer: The style sheets do not permit <license>, the validation would hence fail.

    • Consider to change this in a future style sheet update.

9.2) [imported dictionaries] Date of imported source

  • Q: Where to annotate a date special to a source the final TEI was imported from.

  • A: Annotate within sourceDesc.

    • Q: As plain text?

9.3) [answered] fileDesc/publicationStmt/pubPlace

  • HowTo: <ref>https://freedict.org/</ref>

  • (example) TEI: <ref target="http://freedict.org/">http://freedict.org/</ref>

  • A: The HowTo is right.

9.6) [answered] [imported dictionaries] fileDesc/editionStmt/edition (version)

  • Question: What to use when the TEI output is both influenced by a source's version and an importer's version?

  • Answers

    • Whatever works or seems logical.
    • Options: srcver.importerver | date | srcver | srcver.date

9.7) [answered] [imported dictionaries] fileDesc/titleStmt/editor

  • Q: Set author of importer as editor?
    • TEI Guidelines: "[...] acting as editor, compiler, translator, etc."
  • A: Permitted.

10) [answered] Q: Should xr/ref have a content, or can a @target suffice.

A: Content!

11) Grouping of homographs

  • Options:

    • superEntry/entry
    • entry/sense
    • entry/hom
    • entry/entry
      • illegal in (FreeDict) TEI, suggested in TEI Lex-0.
  • Q: Is superEntry ok?

    • A: No. "It doesn't seem necessary at all and is on its way out, in general." / @bansp
    • A: Not handled by stylsheets. Also, hom is ignored.
  • Q: [imported dictionaries] What if it is not clear from the source whether two homographs qualify as senses of the same word?

    • Note: The "Ding" dictionary contains many words repeatedly, usually with (close to) identical meaning.
    • Q: If grouping, what to do with potentially differing annotations, including abbreviations, gramGrp, inflected forms?
      • Q: Only keep what applies to all on the top level?
      • Q: Are all the tags valid e.g. on the sense level?

12) [answered] Presentational information

  • Examples: "{v}" - the braces, ";", "~" - for references
  • A: Drop

13) [answered] "to" prefix for verbs

  • A: Drop.

14) [answered] Multiple genders

  • Ex.: "Avis {m,n}" (german)
  • A: Two <gen> in a single gramGrp.

16) [answered] Encoding of plain text annotations on headwords (and translations)

  • Examples:

    • "bread (baked in an oven)"
    • "bread (wheat product)"
  • Options:

    • <note>
    • <usg> -- @type="hint"?
      • Usually used for more specific usages, e.g. "Am.", "med.".
    • <def>
  • Answers:

    • [imported dictionary] When undistinguishable, use <note>
    • When writing by hand, try to distinguish (def, usg with specific @type).

17) Collocates

  • Cases:

    • a) case information: "wegen {+Gen.}"
      • see 17.2)
    • b) auxiliary words representing an object
      • b.1) suffixing: "eat sth."
      • b.2) prefixing: "etw. essen"
      • b.3) alternatives: "notify sth./sb."
        • b.3.1) switchable words: "to file away <> sth." (indicating the alternatives "to file away sth." and "to file sth. away")
      • b.4) several: "give sth. to sb."
        • potentially both prefixing and suffixing
    • c) specific word(s)
      • c.1) suffixing: "dismounting (of a machine)"
      • c.2) prefixing
      • c.3) combinations
    • d) combinations of a), b), c)
  • Available tags

    • <colloc> (occurs in <gramGrp>)
      • attribute @type="left"?
        • possible conflict with @type as suggested in 17.2).
    • <usg type="colloc">
      • attribute @subtype="left"?
    • <cit type="colloc">
      • Nested inside <cit type="trans">, seen in eng-pol.
        • See also 25.3)
  • Answers

    • For a), see 17.2).
    • b) colloc
    • c) usg[@type="colloc"]
  • Proposed answers:

    • b.i): <colloc>. This is grammar information.
    • b.ii): @type or @subtype with value obj (or similar).
    • c): <usg type="colloc">/<cit type="colloc">. This is not grammar information.
    • location: @subtype="left" resp. "right".
    • order: keep both <colloc> and <usg type="colloc"> (resp. cit) in the original order.
      • Keeping the order of the union of both is impossible with the given suggestion, but things like "(of a machine)" are supposed to be optional anyways.
    • b.3) (alternatives)
      • i) group in <choice> or similar.
      • ii-iv) see below
      • v) Use @n to define an order. Interchangeable collocates get the same `@n'.
      • iii) conflicts with several subsequent <colloc>s
<form><!-- ii) -->
  <orth>notify</orth>
  <gramGrp><colloc>sth.</colloc></gramGrp>
  <form type="alternate">
     <orth>notify</orth>
     <gramGrp><colloc>sb.</colloc></gramGrp>
  </form>
</form>
<!-- OR iii) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth.</colloc>
    <colloc>sb.</colloc>
  </gramGrp>
</form>
<!-- OR iv) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth./sb.</colloc>
  </gramGrp>
</form>

17.2) collocates' case/pos // was 15)

  • How to encode "{+Gen.}", indicating that an object in the genitive case should follow?

    • Special case: "{wo?, wann? +Dat.}" -- further enriched with corresponding interrogative pronoun(s)
    • Similarly for POS: "{+conj}"
  • Option:

    • <colloc>[+ Gen.]</colloc> (where "Gen." might be changed to something else)
      • Derived from TEI Lex-0
      • [] is not very nice.
      • Likely use a non-language-specific case-abbreviation (i.e., "gen")
    • <colloc type="case">
      • Would require a corresponding type for regular collocates, such as in the TEI GUidelines' example "médire de".
        • Option: @type="plain".
  • See also: 17), in particular @type="left".

18) Grouping of annotations

  • Consider "[formal/Am.]" vs. "[formal] [Am.]".

    • The former indicates a disjunction, the latter a conjunction of the two annotations.
    • Also possible with grammar annotations.
  • Q: How to differentiate?

  • Options:

    • a) Don't.
    • b) For grammar annotations: Several gramGrps.
    • c) Literal retaining of the slash (or similar separator).
      • May forbid to set a common @type (such as in the example above).
    • d) Something like <choice> for disjunctions.

19) Q: Which content should grammar elements have?

  • Options
    • Short english forms from shared/FreeDict_ontology.xml
    • Anything, but link to that ontology, as done in eng-pol.tei.

20) Alternatives in a headword or translation (</>)

  • Example: "biological breakdown/degradation"

  • Q: How to encode

  • Options:

    • literally
    • derive two distinct headwords/translations
      • headwords:
        • link with xr/ref
        • sub-form with @type="alternate" or similar.
      • translations: separare cit elements
    • Something else (e.g. something like choice)
      • likely only an option for translations.

22) Q: Are entries without translations permitted?

  • A: Only if they contain any information within a sense, such as a reference (<ref>).
    • only gramGrp or inflected forms are insufficient.

23) Q: What about several subc?

  • Cases

    • a) same main part: "v/trans" + "v/intr"
      • Example: "essen {vt;vi}"
    • b) different main part (awkward): "v/trans" + "pron/rel"
  • Options

    • a.1) One pos followed by several subc.
    • *.2) Two pairs of pos, subc
    • *.3) two gramGrp
    • *.4) only (two) pos, content e.g. "vt".
    • a.5) `trans/intr

24) [answered] form @type: infl vs. inflected

  • Status quo

    • ML, Wiki, lg1-lg2.tei: infl
    • TEI Guidelines, TEI Lex-0: inflected
  • A: infl

    • FreeDict-TEI specific
    • Consider to change someday.

25.1) [answered] usg inside form[@type="inflected"]?

  • A: OK.

25.2) usg inside colloc?

  • Not permitted by the TEI Guidelines.
    • Neither is usg inside usg (where the latter might have @type="colloc").
  • Example (from Ding): "{prp; +Gen.; +Dat. [ugs.]}"
    • See also 17.2) on why "+Dat." becomes a colloc element.

25.3) [partially answered] Which information to annotate to translations (<cit type="trans" />)?

  • Possible annotations

    • [answered] usg
      • Depending on @type?
      • Q: Use nested cit instead?
        • eng-pol has e.g.: <cit type="colloc">
    • gramGrp
      • Q: Exclude information that can be safely derived from the corresponding source language's gramGrp?
    • colloc -- probably yes
    • note
      • Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
        • Suggestion: first two () become <note>s inside entry/sense, the last one a <note> inside <cit type="trans">.
    • [answered] Abbreviations
      • How?
    • inflected forms // was 26)
      • Likely yes.
      • How?
    • [answered] examples // was 21)
      • (It's common to have an example for a headword, together with a translation.)
      • (Question is, what about examples particular to the translation.)
      • Likely realisation: <cit type="trans"><quote /><cit type="example" /></cit>
  • Answers

    • Anything that is valid TEI is OK.
    • abbreviations: cit[@type="abbrev"]
    • examples: options:
      • a) Even if particular to the translation, keep on the <sense> level.
        • a.1) <cit type="example"><quote xml:lang="SRCLANG" /><quote xml:lang="TGTLANG" /></cit>
        • a.2) <cit type="example"><quote xml:lang="SRCLANG" /><cit type="trans" xml:lang="TGTLANG"><quote xml:lang="TGTLANG" /></cit></cit>
        • (There may be several more quote elements.)
      • b) Add inside <cit type="trans">, next to the <quote> element.
        • Translation in the source language may be added, within a nested <cit type="trans">, like in a.2).

27) Singulare/plurale tantum

  • Such is a noun that only occurs in singular or plural form, respectively.
  • Q: How to encode?
  • Likely: <num>pl</num><subc>no sg</subc> (plurale tantum)
⚠️ **GitHub.com Fallback** ⚠️