Column based text format - redewiedergabe/corpus GitHub Wiki

General remarks

Corpus metadata is available in the file rw_corpus_metadata (tsv format or Excel format)

The corpus consists of separated UTF-8 coded files with the file ending "tsv" (tab-separated values). Each file contains a sample in a column-based format. The columns are separated by tabstops and each line corresponds to one token of the sample (Tokenization was performed with CAB available via Deutsches Textarchiv).

In addition to the annotations added by the Redewiedergabe project, the files also contain morpho-syntactic annotation produced by automatic tools that were not developed by the Redewiedergabe project.

For a more detailed explanation of the annotation structure see Annotation structure.

References:

CAB ("Cascaded Analysis Broker" for error-tolerant linguistic analysis): Jurish, B. Finite-state Canonicalization Techniques for Historical German. PhD thesis, Universität Potsdam, 2012 (defended 2011). URN urn:nbn:de:kobv:517-opus-55789. Documentation
RF-Tagger: Helmut Schmid and Florian Laws: Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging, COLING 2008, Manchester, Great Britain. Documentation

Columns

Column	Description	Typ
tok	token (tokenization by CAB)	surface
normtok	ortographically normalized token (provided by CAB)	NLP information
lemma	lemma (provided by CAB)	NLP information
pos	morphological information (provided by CAB)	NLP information
rfpos	morphological information (provided by RF-Tagger)	NLP information
sentstart	information, whether the token appears at the beginning of a sentence; values: yes/no (sentence splitting by CAB)	NLP information
stwr	main STWR annotation	STWR annotation
frame	frame of the STWR	STWR annotation
speaker	speaker/source of the STWR	STWR annotation
intexpr	word/s introducing the STWR	STWR annotation
note	information whether the words are part of a footnote; values: note/-	structural annotation

Structure of the main STWR annotation

The STWR annotation can have many attributes and can be nested. Because of this, the column stwr contains complex values.

The character | (vertical line) is used as separator, if a token has several annotations of different levels (nested STWR). The first annotation can be interpreted as level=1, the second annotation as level=2 etc. (Maximum nesting depth is 5 in the beta release)
The character . (dot) is used to separate the attributes of a single STWR annotation. The following order is used: Typ, Medium, ID, Nonfact, Border, Prag, Metaph. The three bolded attributes are obligatory, the other four may be missing, if the corresponding attributes are not relevant.
The attribute border is further specified by a value after the character = (equals sign) (e.g. border=state)
Alternative values are separated by the character _ (low dash). This can happen for the attributes type (e.g. indirect_freeIndirect) und medium (e.g. speech_thought).

Example:

direct.speech_writing.3|reported.thought.6.nonfact.border=state

This token has two overlapping annotations:

At level 1: direct STWR with ID=3, medium is ambigous between speech and writing
At level 2: reported thought with ID=6 and attributes "nonfact" and "border" with the value "state".

Structure of the annotations frame, speaker and intexpr

These annotations each consist of a name (frame, speaker, intexpr), the character . (dot), and the corresponding ID.

Speaker can have multiple IDs, which are separated by the character _ (low dash).

Example:

speaker.12_19

This token is annotated as speaker with two IDs: 12 and 19 (i.e. this speaker is associated with the STWR annotation with ID=12 as well as the STWR annotation with ID=19).