Beta release statistics - redewiedergabe/corpus GitHub Wiki
The following statistics are for the beta release version of the "Redewiedergabe" corpus. Tokenization was performed with CAB available via Deutsches Textarchiv (see also Column-based text format).
Samples and tokens
In total, the corpus contains 619 samples and 360,974 tokens.
Decade | fictional (samples) | fictional (tokens) | non-fictional (samples) | non-fictional (tokens) | all samples | all tokens |
---|---|---|---|---|---|---|
1840 | 36 | 22,506 | 41 | 22,647 | 77 | 45,153 |
1850 | 37 | 22,530 | 42 | 22,831 | 79 | 45,361 |
1860 | 37 | 23,115 | 38 | 22,239 | 75 | 45,354 |
1870 | 37 | 22,359 | 39 | 22,431 | 76 | 44,790 |
1880 | 35 | 22,159 | 42 | 22,389 | 77 | 44,548 |
1890 | 36 | 22,633 | 42 | 22,755 | 78 | 45,388 |
1900 | 37 | 22,528 | 41 | 22,820 | 78 | 45,348 |
1910 | 37 | 23,064 | 42 | 21,968 | 79 | 45,032 |
total | 292 | 180,894 | 327 | 180,080 | 619 | 360,974 |
STWR instances
The following tables list the number of annotated STWR instances in the corpus. These instances vary greatly in length, between one token (possible for STWR types direct and reported) and several sentences (possible for STWR types direct, freeIndirect and indirect/freeIndirect.)
STWR types
Type | Number | Percent |
---|---|---|
direct | 2929 | 31.1% |
indirect | 2077 | 22.1% |
free indirect | 97 | 1.0% |
reported | 4204 | 44.6% |
indirect/free indirect | 112 | 1.2% |
total | 9419 |
STWR medium
Medium | Number | Percent |
---|---|---|
speech | 6003 | 63.7% |
thought | 2241 | 23.8% |
writing | 873 | 9.3% |
ambig | 302 | 3.2% |
total | 9419 |