Beta release statistics - redewiedergabe/corpus GitHub Wiki

The following statistics are for the beta release version of the "Redewiedergabe" corpus. Tokenization was performed with CAB available via Deutsches Textarchiv (see also Column-based text format).

Samples and tokens

In total, the corpus contains 619 samples and 360,974 tokens.

Decade fictional (samples) fictional (tokens) non-fictional (samples) non-fictional (tokens) all samples all tokens
1840 36 22,506 41 22,647 77 45,153
1850 37 22,530 42 22,831 79 45,361
1860 37 23,115 38 22,239 75 45,354
1870 37 22,359 39 22,431 76 44,790
1880 35 22,159 42 22,389 77 44,548
1890 36 22,633 42 22,755 78 45,388
1900 37 22,528 41 22,820 78 45,348
1910 37 23,064 42 21,968 79 45,032
total 292 180,894 327 180,080 619 360,974

STWR instances

The following tables list the number of annotated STWR instances in the corpus. These instances vary greatly in length, between one token (possible for STWR types direct and reported) and several sentences (possible for STWR types direct, freeIndirect and indirect/freeIndirect.)

STWR types

Type Number Percent
direct 2929 31.1%
indirect 2077 22.1%
free indirect 97 1.0%
reported 4204 44.6%
indirect/free indirect 112 1.2%
total 9419

STWR medium

Medium Number Percent
speech 6003 63.7%
thought 2241 23.8%
writing 873 9.3%
ambig 302 3.2%
total 9419