ELAN tiers - langdoc/FRechdoc GitHub Wiki

This page documents the ELAN tier structures used by our projects. Adherance to these structures is necessary to use the annotation engine scripts which automatically tokenize and add morphosyntactic annotations.

ELAN Linguistic Types

Mandatory types:

Name Stereotype Purpose
refT n.a. for ref-tiers; no stereotype, independent and time-alignable root nodes
orthT symbolic association for orth-tiers, exact time-aligned copy of superordinate ref-tier
ft-(<…>)T symbolic association for ft-tiers into multiple languages (defined with <…>), overall time-aligned copy of the orth-tier (and thus ref-tier)
wordT symbolic subdivision for word-tiers, overall time-aligned copy of orth-tier (and thus ref-tier), but able to be divided into multiple equal parts
lemmaT symbolic subdivision for lemma-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
posT symbolic subdivision for pos-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
morphT symbolic subdivision for morph-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
noteT symbolic association directly under reference tier, contains notes which are related to the specific segment under the current reference

Additional types can be used for project specific uses. Examples for additional types are:

Name Stereotype Purpose
orth-origT symbolic association
ft-(<…>)-origT symbolic association
word-goldT symbolic subdivision
lemma-goldT symbolic subdivision
pos-goldT symbolic subdivision
morph-goldT symbolic subdivision

Definitions and specifications are unsolved for the following types:

Name Stereotype Purpose Note
langT symbolic ?? for lang-tier; this indicates the language(s) being used in the corresponding syntactic unit Ideally, this could trigger which set of lang-tech tools to use, but the reality, especially for spoken, endangered languages, is that it is really messy because switching can and often does take place at a variety of syntactic levels (from morpheme-level such as nonce-borrowings with L1-morphology, to phrases, to utterances, to entire stretches of discourse). On the other hand, lang could also probably be populated automatically. Ultimately it will remain somewhat rough, as real language tagging in our data depends on our approach to code-switching and code-mixing annotation models
synthT symbolic ?? for synth-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts

There are some types which are present in some files, but should be removed once a better solution is found:

Name Stereotype Purpose Note
note(part)T symbolic association to contain notes for parts as parts are probably automatically detected to some degree, there is little reason to have manual notes under them this way
note(word)T symbolic association to contain word level notes as tokenization is done automatically, it is not possible to store manual notes directly under a word

ELAN Tiers and Tier Hierarchy

Minimally required for each speaker:

Level Name Parent Tier Linguistic Type Purpose
0 ref n.a. refT root node, time-aligned annotation units, each provided with a unique number here
-1 orth ref orthT orthographic transcription; this provides the input for the annotation engine

Optional for each speaker:

Level Name Parent Tier Linguistic Type Purpose
-2 ft-(<…>) orth ftT free translation of the annotated text; XYZ is replaced with a language code (e.g. eng, rus, etc.); can occur multiple times for multiple lingua francas
-2 lang orth langT indicates the language being used in an annotated utterance or part of an annotated utterance; the language name is in English; adheres to 'languages'-list of controlled vocabulary; this has serious practical problems described in the notes on the corresponding type above (langT)

Optional for any tier or as a root node with its own time-alignment:

Level Name Parent Tier Linguistic Type Purpose
''*'' note-(<…>) (<…>) noteT provide unstructured text-based notes for any given parent tier specified with <…>

These tiers are created automatically by the annotation engine with the result that existing annotations are overwritten each time the engine runs. Therefore, these tiers can not have other dependent tiers.

Level Name Parent Tier Linguistic Type Purpose
-2 word orth wordT preprocessed-tokenized version of the orth-tier; automatically created by the annotation engine
-3 lemma word lemmaT lemma (or lemmata in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine
-3 morph word morphT morphological category (or categories in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine
-3 pos word posT part of speech (or parts of speech in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine

General Rules

  • all tiers for a given speaker are named using the tier name plus the @ symbol plus an short form referring to the relevant speaker, such as ref@JKW, lemma@JKW
  • Each ref annotation should have content and it be unique (typically numbered)
  • Each tier must have participant, uniquely identified after @ in tier name; standard for naming participants can be project-specific
  • These tiers must have a language specified in the attribute: orth, word, lemma, ft-(<…>), note-(<…>)

Screenshot to illustrate this visually

  • this screenshot shows what the tier hierarchy can look like for a single speaker (as implemented in the Pite Saami projects)

hierarchy screenshot (example)

ELAN Tier Template Files for Download

Template files (in ELAN .etf format):

  • For spoken corpus data
    • [Master template|ELAN/ELAN_spoken_template.etf] (including all possible - annotation tiers and linguistic types)
    • [PSDP template|ELAN/ELAN_spoken_template_PSDP.etf] (including only annotation tiers and linguistic types relevant for PSDP)
    • KSDP template (including only annotation tiers and linguistic types relevant for KSDP
    • IKDP template (including only annotation tiers and linguistic types relevant for IKDP
  • For written corpus data
    • Master template (including all possible annotation tiers and linguistic types)
    • PSDP template (including only annotation tiers and linguistic types relevant for PSDP)
    • KSDP template (including only annotation tiers and linguistic types relevant for KSDP
    • IKDP template (including only annotation tiers and linguistic types relevant for IKDP

Example for session naming convention, as used for Pite Saami data

  • all sessions begin with iso-code (sje)
  • this is followed by the full date of the original recording (or transcription if no recording is available) in YYYYMMDD format
    • 0000 is used to represent the day and month if these are not known
  • if multiple sessions exist for the same date, then they are given an additional identifier directly after the data, usually a single letter starting with a and increasing alphabetically
  • to further distinguish a session, or help simplify identification of the content/source/etc, additional descriptors can be appended using a dash (-) after the above
  • only ASCII letters, numbers, dash (-) and underline (_) are used
    • no white space characters allowed
  • Examples:
    • sje19140000a-ruong1982a
    • sje19290000a-qvigstad1929c-15-2
    • sje19440830a
    • sje20140919
    • sje20150329a

Unsolved questions and other things to do

  • To add orig-convention to places where it is needed
  • To come up with some solution to resolve note-tiers
  • Ref tier pattern: .0024?
  • Translation-tiers where ‘lang’ is replaced with specific language (eng=english, deu=deutsch -- use ISO-codes?); can have multiple derivations. Should we shift to Glottocodes in some point?
  • If we have a situation that for some reason a file contains a tier that just makes no sense, but we have to live with it, maybe there should be some convention to tag a linguistic type as non-standard? I added into one file now a type called: words-ipaTnst, just because the file comes from an old project and has IPA-transcriptions which are not under the ref tier. They could be put there, but I remember there were some reasons that made it very difficult to move nicely.
⚠️ **GitHub.com Fallback** ⚠️