discussion TEI - freedict/fd-dictionaries GitHub Wiki
- Points 1-9 are directly taken from an
ML thread,
numbered A1-A9 there.
- Points 10-X also stem from around that thread.
-
Questions
- What about the TEI Lex-0 standard?
- Should it be followed?
-
Examples
- a)
<gram type="gender"/>instead of<gen/>. - b)
<usg>with@type(and possibly@norm)
- a)
-
Potential advantages
- good, fixed list of
usgtypes (see this comparison table)- The useful
@typestextTypeandattributehave no equivalents in the TEI Guidelines' suggested values.-
textTypeexamples: bibl., poet., admin., journalese -
attributeexamples: derog., euph.
-
- The useful
- Requirement to fully annotate with
@xml:idand@xml:id
- good, fixed list of
-
Further questions:
- Should
textTypeandattributejust be borrowed from TEI Lex-0? - Where to annotate with
@xml:idand@xml:lang?
- Should
-
Answers
- The FreeDict conversion style sheets do not support TEI Lex-0. (FreeDict TEI is in parts incompatible with TEI Lex-0)
- "It all boils down to somebody reading the document, defining our specific requirements and potentially modification and implementing it." / @shumenda
- The TEI Lex-0 guidelines may be used in addition wherever they do not contradict the FreeDict or TEI guidelines.
- TEI Lex-0 is meant to encode retrodigitized dictionaries including presentational information, while FreeDict TEI is not concerned with such.
- Consider to someday switch to another (related) standard: ISO LMF-4
- No public information yet.
- ISO standard is not available for free
- There is a skeletal example document
-
See also: this thread on the mailing list.
-
Status quo
-
Questions: How to annotate transitivity information?
-
Answer: The use of
subcis strongly recommended.
-
Question: How can I enrich my dictionary with pronunciation, as annotated in
<pron>tags? -
Answer: Unless present, the standard build process, using
make, adds phonetics information using the teiaddphonetics script (which internally usese speak[-ng]).
-
Question: Should usage annotations (the content of
<usg>tags) be normalized?- different languages (e.g. "[Sprw.]" ~ "[prov.]")
- same language (e.g. "[coll.]" ~ "[slang]")
-
Notes:
- Recommended by TEI Lex-0.
- The usage of
@normin<usg>might render this less an issue.
-
Sub-questions
- Should they be normalised to a single label?
- Should they be normalised to some standard labels?
- ISO 12620 (cf. Wikipedia:Registers) (full standard only commercially available)
-
Answers
- An ontology should be defined.
- Questions:
- Similar to / linked to
shared/FreeDict_ontology.xml?- This seems to only allow linking equivalent annotations in different languages, however not "coll." and "slang" (if these should even be considered equivalent).
- Where to find documentation on writing such an ontology?
- Similar to / linked to
- Questions:
- An ontology should be defined.
-
Examples
- "mainly Am."
- "bes. Süddt.", "especially Am."
-
Question
- How to represent the determiner ("mainly", "bes.", ...)?
-
Notes
- TEI Lex-0 suggests a separate attribute, but not which (there is a TODO
in the doc).
- None of the
<usg>annotations really fit, maybe@subtype?
- None of the
- TEI Lex-0 suggests a separate attribute, but not which (there is a TODO
in the doc).
-
Answer
- Likely the easiest:
<usg type="hint">mainly Am.</usg>
- Likely the easiest:
-
classes of such annotation
- a) dialect
- Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
- distinction from b) partially unclear (e.g., "Am.")
- b) Region or country
- Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
- c) Ex.: "[French]", "[Lat.]"
- a) dialect
-
Questions
- How to annotate/distinguish the above classes?
-
Notes
- TEI Lex-0:
usg[@type="geographic"]: "marker which identifies the place or region where a lexical unit is mainly used"- Matches b), potentially partly a).
- TEI Lex-0:
-
Answers
- a), b):
usg[@type="geo"] - c):
usg[@type="lang"]- See the TEI Guidelines's corresponding section.
- Alternatively: Craft new type and document in the header
(
usgtypes may be be freely chosen according to the TEI Guidelines.)- Also consider to adopt such a new type in the FreeDict guidelines.
- Use plain text but name the tag and attribute name explicitly.
- Consider to use a list of languages (e.g., this).
- a), b):
-
Notes example (
ps andlists are both fine):
<notesStmt>
<note type="status">small</note> <!-- mandatory for our DB -->
<note xml:lang="de"> <!-- can be freely chosen -->
<list><item>blah</list>
</note>
</notesStmt>-
Cases
- a) Headwords, which are annotations.
- rare
- b) Annotated on headwords.
- a) Headwords, which are annotations.
-
Question: How to represent in TEI?
-
Notes
-
Answers
- An
entryshould only contain a singleformtag. - An
entry/formmay contain a nestedform[@type="abbrev"]element. - In the case of a standalone abbreviation, the corresponding
formelement right belowentryshould be annotated with@type="abbrev".- potential issue: Shouldn't the topmost
formelements have@type="lemma"?
- potential issue: Shouldn't the topmost
- An
- Answer: Both are fine (also in parallel).
- Consider to put
gramGrpinsideform, when also insense.
- Consider to put
-
Question: Currently
<availability>is suggested and used exclusively (for licensing information). Why not<license>? -
Answer: The style sheets do not permit
<license>, the validation would hence fail.- Consider to change this in a future style sheet update.
-
Q: Where to annotate a date special to a source the final TEI was imported from.
-
A: Annotate within
sourceDesc.- Q: As plain text?
-
HowTo:
<ref>https://freedict.org/</ref> -
(example) TEI:
<ref target="http://freedict.org/">http://freedict.org/</ref> -
A: The HowTo is right.
-
Question: What to use when the TEI output is both influenced by a source's version and an importer's version?
-
Answers
- Whatever works or seems logical.
- Options: srcver.importerver | date | srcver | srcver.date
- Q: Set author of importer as editor?
- TEI Guidelines: "[...] acting as editor, compiler, translator, etc."
- A: Permitted.
A: Content!
-
Options:
superEntry/entryentry/senseentry/hom-
entry/entry- illegal in (FreeDict) TEI, suggested in TEI Lex-0.
-
Q: Is superEntry ok?
- A: No. "It doesn't seem necessary at all and is on its way out, in general." / @bansp
- A: Not handled by stylsheets. Also,
homis ignored.
-
Q: [imported dictionaries] What if it is not clear from the source whether two homographs qualify as senses of the same word?
- Note: The "Ding" dictionary contains many words repeatedly, usually with (close to) identical meaning.
- Q: If grouping, what to do with potentially differing annotations, including abbreviations,
gramGrp, inflected forms?- Q: Only keep what applies to all on the top level?
- Q: Are all the tags valid e.g. on the
senselevel?
- Examples: "{v}" - the braces, ";", "~" - for references
- A: Drop
- A: Drop.
- Ex.: "Avis {m,n}" (german)
- A: Two
<gen>in a singlegramGrp.
-
Examples:
- "bread (baked in an oven)"
- "bread (wheat product)"
-
Options:
<note>-
<usg>--@type="hint"?- Usually used for more specific usages, e.g. "Am.", "med.".
<def>
-
Answers:
- [imported dictionary] When undistinguishable, use
<note> - When writing by hand, try to distinguish (
def,usgwith specific@type).
- [imported dictionary] When undistinguishable, use
-
Cases:
- a) case information: "wegen {+Gen.}"
- see 17.2)
- b) auxiliary words representing an object
- b.1) suffixing: "eat sth."
- b.2) prefixing: "etw. essen"
- b.3) alternatives: "notify sth./sb."
- b.3.1) switchable words: "to file away <> sth." (indicating the alternatives "to file away sth." and "to file sth. away")
- b.4) several: "give sth. to sb."
- potentially both prefixing and suffixing
- c) specific word(s)
- c.1) suffixing: "dismounting (of a machine)"
- c.2) prefixing
- c.3) combinations
- d) combinations of a), b), c)
- a) case information: "wegen {+Gen.}"
-
Available tags
-
<colloc>(occurs in<gramGrp>)- attribute
@type="left"?- possible conflict with
@typeas suggested in 17.2).
- possible conflict with
- attribute
-
<usg type="colloc">- attribute
@subtype="left"?
- attribute
-
<cit type="colloc">- Nested inside
<cit type="trans">, seen ineng-pol.- See also 25.3)
- Nested inside
-
-
Answers
- For a), see 17.2).
- b)
colloc - c)
usg[@type="colloc"]
-
Proposed answers:
- b.i):
<colloc>. This is grammar information. - b.ii):
@typeor@subtypewith valueobj(or similar). - c):
<usg type="colloc">/<cit type="colloc">. This is not grammar information. - location:
@subtype="left"resp. "right". - order: keep both
<colloc>and<usg type="colloc">(resp.cit) in the original order.- Keeping the order of the union of both is impossible with the given suggestion, but things like "(of a machine)" are supposed to be optional anyways.
- b.3) (alternatives)
- i) group in
<choice>or similar. - ii-iv) see below
- v) Use
@nto define an order. Interchangeable collocates get the same `@n'. - iii) conflicts with several subsequent
<colloc>s
- i) group in
- b.i):
<form><!-- ii) -->
<orth>notify</orth>
<gramGrp><colloc>sth.</colloc></gramGrp>
<form type="alternate">
<orth>notify</orth>
<gramGrp><colloc>sb.</colloc></gramGrp>
</form>
</form>
<!-- OR iii) -->
<form>
<orth>notify</orth>
<gramGrp>
<colloc>sth.</colloc>
<colloc>sb.</colloc>
</gramGrp>
</form>
<!-- OR iv) -->
<form>
<orth>notify</orth>
<gramGrp>
<colloc>sth./sb.</colloc>
</gramGrp>
</form>
-
How to encode "{+Gen.}", indicating that an object in the genitive case should follow?
- Special case: "{wo?, wann? +Dat.}" -- further enriched with corresponding interrogative pronoun(s)
- Similarly for POS: "{+conj}"
-
Option:
-
<colloc>[+ Gen.]</colloc>(where "Gen." might be changed to something else)- Derived from TEI Lex-0
-
[]is not very nice. - Likely use a non-language-specific case-abbreviation (i.e., "gen")
-
<colloc type="case">- Would require a corresponding type for regular collocates, such as in the
TEI GUidelines' example "médire de".
- Option:
@type="plain".
- Option:
- Would require a corresponding type for regular collocates, such as in the
TEI GUidelines' example "médire de".
-
-
See also: 17), in particular
@type="left".
-
Consider "[formal/Am.]" vs. "[formal] [Am.]".
- The former indicates a disjunction, the latter a conjunction of the two annotations.
- Also possible with grammar annotations.
-
Q: How to differentiate?
-
Options:
- a) Don't.
- b) For grammar annotations: Several
gramGrps. - c) Literal retaining of the slash (or similar separator).
- May forbid to set a common
@type(such as in the example above).
- May forbid to set a common
- d) Something like
<choice>for disjunctions.
- Options
- Short english forms from
shared/FreeDict_ontology.xml - Anything, but link to that ontology, as done in
eng-pol.tei.
- Short english forms from
-
Example: "biological breakdown/degradation"
-
Q: How to encode
-
Options:
- literally
- derive two distinct headwords/translations
- headwords:
- link with
xr/ref - sub-
formwith@type="alternate"or similar.
- link with
- translations: separare
citelements
- headwords:
- Something else (e.g. something like
choice)- likely only an option for translations.
- A: Only if they contain any information within a sense, such as a reference (
<ref>).- only
gramGrpor inflected forms are insufficient.
- only
-
Cases
- a) same main part: "v/trans" + "v/intr"
- Example: "essen {vt;vi}"
- b) different main part (awkward): "v/trans" + "pron/rel"
- a) same main part: "v/trans" + "v/intr"
-
Options
- a.1) One
posfollowed by severalsubc. - *.2) Two pairs of
pos,subc - *.3) two
gramGrp - *.4) only (two)
pos, content e.g. "vt". - a.5) `trans/intr
- a.1) One
-
Status quo
- ML, Wiki,
lg1-lg2.tei:infl - TEI Guidelines, TEI Lex-0:
inflected
- ML, Wiki,
-
A:
infl- FreeDict-TEI specific
- Consider to change someday.
- A: OK.
- Not permitted by the TEI Guidelines.
- Neither is
usginsideusg(where the latter might have@type="colloc").
- Neither is
- Example (from Ding): "{prp; +Gen.; +Dat. [ugs.]}"
- See also 17.2) on why "+Dat." becomes a
collocelement.
- See also 17.2) on why "+Dat." becomes a
-
Possible annotations
- [answered]
usg- Depending on
@type? - Q: Use nested
citinstead?-
eng-polhas e.g.:<cit type="colloc">
-
- Depending on
-
gramGrp- Q: Exclude information that can be safely derived from the corresponding source language's
gramGrp?
- Q: Exclude information that can be safely derived from the corresponding source language's
-
colloc-- probably yes -
note- Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
- Suggestion: first two () become
<note>s insideentry/sense, the last one a<note>inside<cit type="trans">.
- Suggestion: first two () become
- Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
- [answered] Abbreviations
- How?
- inflected forms // was 26)
- Likely yes.
- How?
- [answered] examples // was 21)
- (It's common to have an example for a headword, together with a translation.)
- (Question is, what about examples particular to the translation.)
- Likely realisation:
<cit type="trans"><quote /><cit type="example" /></cit>
- [answered]
-
Answers
- Anything that is valid TEI is OK.
- abbreviations:
cit[@type="abbrev"] - examples: options:
- a) Even if particular to the translation, keep on the
<sense>level.- a.1)
<cit type="example"><quote xml:lang="SRCLANG" /><quote xml:lang="TGTLANG" /></cit> - a.2)
<cit type="example"><quote xml:lang="SRCLANG" /><cit type="trans" xml:lang="TGTLANG"><quote xml:lang="TGTLANG" /></cit></cit> - (There may be several more
quoteelements.)
- a.1)
- b) Add inside
<cit type="trans">, next to the<quote>element.- Translation in the source language may be added, within a nested
<cit type="trans">, like in a.2).
- Translation in the source language may be added, within a nested
- a) Even if particular to the translation, keep on the
- Such is a noun that only occurs in singular or plural form, respectively.
- Q: How to encode?
- Likely:
<num>pl</num><subc>no sg</subc>(plurale tantum)