Beta release statistics - redewiedergabe/corpus GitHub Wiki

The following statistics are for the beta release version of the "Redewiedergabe" corpus. Tokenization was performed with CAB available via Deutsches Textarchiv (see also Column-based text format).

Samples and tokens

In total, the corpus contains 619 samples and 360,974 tokens.

Decade	fictional (samples)	fictional (tokens)	non-fictional (samples)	non-fictional (tokens)	all samples	all tokens
1840	36	22,506	41	22,647	77	45,153
1850	37	22,530	42	22,831	79	45,361
1860	37	23,115	38	22,239	75	45,354
1870	37	22,359	39	22,431	76	44,790
1880	35	22,159	42	22,389	77	44,548
1890	36	22,633	42	22,755	78	45,388
1900	37	22,528	41	22,820	78	45,348
1910	37	23,064	42	21,968	79	45,032
total	292	180,894	327	180,080	619	360,974

STWR instances

The following tables list the number of annotated STWR instances in the corpus. These instances vary greatly in length, between one token (possible for STWR types direct and reported) and several sentences (possible for STWR types direct, freeIndirect and indirect/freeIndirect.)

STWR types

Type	Number	Percent
direct	2929	31.1%
indirect	2077	22.1%
free indirect	97	1.0%
reported	4204	44.6%
indirect/free indirect	112	1.2%
total	9419

STWR medium

Medium	Number	Percent
speech	6003	63.7%
thought	2241	23.8%
writing	873	9.3%
ambig	302	3.2%
total	9419