Metadata - langdoc/FRechdoc GitHub Wiki

This page documents metadata categories and subcategories as well as labels we use for these metadata in the Freiburg-Tromsø Speech Corpora.

Project-internally we collect different kinds of metadata. Not all of them can be made public due to ethical and legal reasons. Here we document only metadata categories relevant for the corpora published through Korp. Main metadata categories describe:

  • Actors (e.g. a recorded speaker, author, translator or annotator)
  • Sessions (e.g. an annotated recording or an annotated written text)
  • Texts (e.g. modality or genre)

All publicly available metadata is stored in files separated from the [ELAN|ELAN.html] annotations in IMDI format on the Session node in the TLA. A script (which does not yet exist) converts IMDI into a structure useful to be read into the Korp interface.

Actors

  • Speakers (e.g. informants/consultants recorded and transcribed or authors/translators of written text included in the corpora)
  • Annotators (e.g. PIs or assistants transcribing, translating or otherwise annotating recordings or written text included in the corpora)
    • These could also be monitored by Git as who has actually changed what

Sessions

  • Actors
  • Date
  • Equipment
  • Media
  • Place
  • Project
  • Languages

Texts

  • Actors
  • Date
  • Language(s)

Modality

As a label for this category we use Modality and mean here the way by which signs are transmitted by a sender. This catory has two values:

  • oral (e.g. speech which we have recorded on audio or audio+video and transcribed or speech which is transcribed, but where there is no audio available because it is lost or the speech was transcribed without being recorded)
  • written (e.g. handwritten or printed texts, texts published online)

Another potential values (not relevant for our projects) are:

  • gestured
  • signed

Note that the kind of perception by a receiver is not relevant for our metadata categories (a written text can be received oraly if we use text-to-speech, etc.) Neither does Modality in our sense refer to the actual medium (paper, video, etc.)

Language

The-letter code in accordance with ISO 639-3. Question: could we shift to Glottolog in some point? The main problem at the moment seems to be that the Glottocodes are very hard to remember.

Genre

  • poetry
  • fiction
  • ritual
  • advertisement
  • biography
  • fairy tale
  • facta
  • idiom
  • narrative
  • teaching
  • story

Register

  • formal
  • informal
  • neutral

Medium

Other conventions

Note that also file names used by us inlcude some metadata already. For instance:

  • sms19610000lagercrantz318
  • sjd20150609aaa-sport

where the first three letters _sms or sjd - in accordance with ISO 639-3 - always mark the language (or main language) of a given session, the following eight digits 19610000 or 20150609 always mark the data of a given session in the format YYYYMMDD. If the exact date is unknown or cannot be specified (e.g. in a book publication were only the year is given) we use the digit 0.

See also