The "Redewiedergabe" corpus - redewiedergabe/corpus GitHub Wiki

Project

The "Redewiedergabe" corpus is created by the DFG-funded project "Redewiedergabe. Eine literatur- und sprachwissenschaftliche Korpusanalyse" in a cooperation between Leibniz-Institut für Deutsche Sprache, Mannheim (Abteilung Lexik) and Universität Würzburg (Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte).

Project members: Annelen Brunner (IDS Mannheim), Stefan Engelberg (IDS Mannheim), Fotis Jannidis (Universität Würzburg), Ngoc Duyen Tanja Tu (IDS Mannheim), Lukas Weimer (Universität Würzburg).

In addition, the following people participated in the annotation: Sarah Gorke, Anna Hartmann, Janne Lorenzen, Christoph Peterek, Laura Schäfer, Lisa Sergel and Theresa Valta.

Project homepage: www.redewiedergabe.de

License

Creative Commons Lizenzvertrag
The "Redewiedergabe" corpus is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

We ask you to mention project "Redewiedergabe" regarding the annotation, and project TextGrid, Deutsches Textarchiv, Leibniz-Institut für Deutsche Sprache and Universitätsbibliothek Bremen regarding the texts.

Text sources

The "Redewiedergabe" corpus is a historical corpus of fictional and non-fictional texts. These texts were published between 1840-1920 and were compiled from the following three sources:

  • Narrative texts from the 'Digitalen Bibliothek', converted to TEI format by project TextGrid
  • Texts from the magazine "Die Grenzboten", digitized by Universitätsbibliothek Bremen (Source: Die Grenzboten: Zeitschrift für Politik, Literatur und Kunst. Berlin: Dt. Verl, 1841-1922. Staats- und Universitätsbibliothek Bremen, Ac 7155 Public Domain Mark 1.0), TEI structuring by Deutsches Textarchiv and OCR correction by project "Redewiedergabe".
  • Texts from the "Mannheimer Korpus Historischer Zeitungen und Zeitschriften" (Mannheim corpus of historical newspapers and magazines), collected by the Leibniz-Institute für Deutsche Sprache and converted by Deutsches Textarchiv.

The corpus does not consist of complete texts but of text samples. The sample length is at least 500 tokens for texts from the Digitale Bibliothek and at least 200 tokens for newspaper/magazine texts. The samples are drawn randomly from the available material with following additional rules: For the texts from the Digitale Bibliothek, it was enforced that material by each author was considered evenly within a decade. Accordingly, for the texts from MKHZ it was enforced that the different newspapers/magazines were considered evenly. Thus we prevented authors or newspapers with little material from dropping out entirely during the sampling process.

Each sample contains metadata with information about the publication time, text type, fictionality status and author and title if available (more information: Metadata).

Annotation

The corpus contains detailed annotation of instances of speech, thought and writing representation (STWR). We distiguish four main types: direct STWR (Er sagte: "Ich bin hungrig."), indirect STWR (Er sagte, er sei hungrig.), free indirect STWR (Wo sollte er jetzt etwas zu Essen herbekommen?) and reported STWR (Er sprach über Restaurants.), as well as the main media speech, thought and writing. In addition to that, we annotate attributes like embedding level, non-factual STWR, borderline cases, pragmatic and metaphoric use, as well as frames, introductory expressions and speakers.

Each corpus sample was annotated independently by two different people. The final annotation was created by a third person on the basis of those annotations.

The detailed annotation guidelines are available at redewiedergabe.de/richtlinien/richtlinien.html (in German).

An overview over the structure of the annotations is available at Annotation structure.

Size

At the moment, the beta release of the corpus is available. At the end of the project (spring 2020), the corpus will be extended and additional annotated material will be made available.

The beta release includes 619 samples and 360,974 tokens. 9,451 STWR instances have been annotated, as well as additional information like frames, introductory expressions and speakers.

Detailed statistical data

Format

The corpus is available in two different formats:

⚠️ **GitHub.com Fallback** ⚠️