ELAN tiers - langdoc/FRechdoc GitHub Wiki

This page documents the ELAN tier structures used by our projects. Adherance to these structures is necessary to use the annotation engine scripts which automatically tokenize and add morphosyntactic annotations.

ELAN Linguistic Types

Mandatory types:

Name	Stereotype	Purpose
refT	n.a.	for ref-tiers; no stereotype, independent and time-alignable root nodes
orthT	symbolic association	for orth-tiers, exact time-aligned copy of superordinate ref-tier
ft-(<…>)T	symbolic association	for ft-tiers into multiple languages (defined with <…>), overall time-aligned copy of the orth-tier (and thus ref-tier)
wordT	symbolic subdivision	for word-tiers, overall time-aligned copy of orth-tier (and thus ref-tier), but able to be divided into multiple equal parts
lemmaT	symbolic subdivision	for lemma-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
posT	symbolic subdivision	for pos-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
morphT	symbolic subdivision	for morph-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts
noteT	symbolic association	directly under reference tier, contains notes which are related to the specific segment under the current reference

Additional types can be used for project specific uses. Examples for additional types are:

Name	Stereotype	Purpose
orth-origT	symbolic association
ft-(<…>)-origT	symbolic association
word-goldT	symbolic subdivision
lemma-goldT	symbolic subdivision
pos-goldT	symbolic subdivision
morph-goldT	symbolic subdivision

Definitions and specifications are unsolved for the following types:

Name	Stereotype	Purpose	Note
langT	symbolic ??	for lang-tier; this indicates the language(s) being used in the corresponding syntactic unit	Ideally, this could trigger which set of lang-tech tools to use, but the reality, especially for spoken, endangered languages, is that it is really messy because switching can and often does take place at a variety of syntactic levels (from morpheme-level such as nonce-borrowings with L1-morphology, to phrases, to utterances, to entire stretches of discourse). On the other hand, lang could also probably be populated automatically. Ultimately it will remain somewhat rough, as real language tagging in our data depends on our approach to code-switching and code-mixing annotation models
synthT	symbolic ??	for synth-tiers, overall equally-spaced, time-aligned copy of the word-tier, but able to be divided into multiple equal parts

There are some types which are present in some files, but should be removed once a better solution is found:

Name	Stereotype	Purpose	Note
note(part)T	symbolic association	to contain notes for parts	as parts are probably automatically detected to some degree, there is little reason to have manual notes under them this way
note(word)T	symbolic association	to contain word level notes	as tokenization is done automatically, it is not possible to store manual notes directly under a word

ELAN Tiers and Tier Hierarchy

Minimally required for each speaker:

Level	Name	Parent Tier	Linguistic Type	Purpose
0	ref	n.a.	refT	root node, time-aligned annotation units, each provided with a unique number here
-1	orth	ref	orthT	orthographic transcription; this provides the input for the annotation engine

Optional for each speaker:

Level	Name	Parent Tier	Linguistic Type	Purpose
-2	ft-(<…>)	orth	ftT	free translation of the annotated text; XYZ is replaced with a language code (e.g. eng, rus, etc.); can occur multiple times for multiple lingua francas
-2	lang	orth	langT	indicates the language being used in an annotated utterance or part of an annotated utterance; the language name is in English; adheres to 'languages'-list of controlled vocabulary; this has serious practical problems described in the notes on the corresponding type above (langT)

Optional for any tier or as a root node with its own time-alignment:

Level	Name	Parent Tier	Linguistic Type	Purpose
''*''	note-(<…>)	(<…>)	noteT	provide unstructured text-based notes for any given parent tier specified with <…>

These tiers are created automatically by the annotation engine with the result that existing annotations are overwritten each time the engine runs. Therefore, these tiers can not have other dependent tiers.

Level	Name	Parent Tier	Linguistic Type	Purpose
-2	word	orth	wordT	preprocessed-tokenized version of the orth-tier; automatically created by the annotation engine
-3	lemma	word	lemmaT	lemma (or lemmata in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine
-3	morph	word	morphT	morphological category (or categories in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine
-3	pos	word	posT	part of speech (or parts of speech in case of ambiguities) for word form listed in parent tier; automatically created by the annotation engine

General Rules

all tiers for a given speaker are named using the tier name plus the @ symbol plus an short form referring to the relevant speaker, such as ref@JKW, lemma@JKW
Each ref annotation should have content and it be unique (typically numbered)
Each tier must have participant, uniquely identified after @ in tier name; standard for naming participants can be project-specific
These tiers must have a language specified in the attribute: orth, word, lemma, ft-(<…>), note-(<…>)

Screenshot to illustrate this visually

this screenshot shows what the tier hierarchy can look like for a single speaker (as implemented in the Pite Saami projects)

hierarchy screenshot (example)

ELAN Tier Template Files for Download

Template files (in ELAN .etf format):

For spoken corpus data
- [Master template|ELAN/ELAN_spoken_template.etf] (including all possible - annotation tiers and linguistic types)
- [PSDP template|ELAN/ELAN_spoken_template_PSDP.etf] (including only annotation tiers and linguistic types relevant for PSDP)
- KSDP template (including only annotation tiers and linguistic types relevant for KSDP
- IKDP template (including only annotation tiers and linguistic types relevant for IKDP
For written corpus data
- Master template (including all possible annotation tiers and linguistic types)
- PSDP template (including only annotation tiers and linguistic types relevant for PSDP)
- KSDP template (including only annotation tiers and linguistic types relevant for KSDP
- IKDP template (including only annotation tiers and linguistic types relevant for IKDP

Example for session naming convention, as used for Pite Saami data

all sessions begin with iso-code (sje)
this is followed by the full date of the original recording (or transcription if no recording is available) in YYYYMMDD format
- 0000 is used to represent the day and month if these are not known
if multiple sessions exist for the same date, then they are given an additional identifier directly after the data, usually a single letter starting with a and increasing alphabetically
to further distinguish a session, or help simplify identification of the content/source/etc, additional descriptors can be appended using a dash (-) after the above
only ASCII letters, numbers, dash (-) and underline (_) are used
- no white space characters allowed
Examples:
- sje19140000a-ruong1982a
- sje19290000a-qvigstad1929c-15-2
- sje19440830a
- sje20140919
- sje20150329a

Unsolved questions and other things to do

To add orig-convention to places where it is needed
To come up with some solution to resolve note-tiers
Ref tier pattern: .0024?
Translation-tiers where ‘lang’ is replaced with specific language (eng=english, deu=deutsch -- use ISO-codes?); can have multiple derivations. Should we shift to Glottocodes in some point?
If we have a situation that for some reason a file contains a tier that just makes no sense, but we have to live with it, maybe there should be some convention to tag a linguistic type as non-standard? I added into one file now a type called: words-ipaTnst, just because the file comes from an old project and has IPA-transcriptions which are not under the ref tier. They could be put there, but I remember there were some reasons that made it very difficult to move nicely.