Text specific information for transcription - DCMLab/ddd GitHub Wiki

On this page, I will collect helpful information for the transcription of each text. This is a continuously updated work-in-progress and anyone working on transcriptions should feel encouraged to contribute to it! One of the goals of this is to summarize the most common OCR errors and idiosyncrasies of layout and spelling that are present in the text, in order to help detect errors more reliably.

[So far, this is just a very rough outline]

Moritz Hauptmann - Die Natur der Harmonik und der Metrik

OCR Quality and most common errors

Generally, the OCR transcriptions are very accurate
Connecting dashes are almost never recognized as "angled dash", but as normal dashes.
Umlauts are sometimes confused with their counterpart (ü = u ; ä = a, ö = o)
Some lines are doubled in wrong paragraphs and need to be removed
Sometimes merges words printed close to one another.
"Digitized by" [Google] needs to removed on most pages.

Unusual spelling in Text

Avoiding capital Ä? -> "Aesserlich-äusserste" (p. 25, 4-3)
Sometimes capitalizes "Das" in the middle of a line (see pp. 5, 31, 336), when not used as an article to a noun. (i. e. "[...] über Das nachzudenken, was ihnen durch das natürliche Gefühl gesichert scheint", pp. 5f.)

Special layout choices

Sometimes uses a long-dash at the end of a paragraph to signal a stronger break. (see p. 15)

Hugo Riemann - Ideen zu einer Lehre von den Tonvorstellungen

OCR Quality and most common errors

OCR generally accurate for most normal text here.
Umlauts: small problems, particularly capital Ö = O
paragraphs are numbered as "- 2 -", which often need to be adjusted manually.
Single quotation marks are often missing or wrongly replaced.
Occasionally, the line recognition is broken for sections of a paragraph and needs to be redone manually.

Spelling and layout

Sometimes the text itself, which is a reproduction of the original, may contain spelling errors that are likely not present in the original (c = e, rm = m, in = m - typical OCR errors on their part?)
The entire text is a pdf-output of a modern reproduction of the source. The pages of said reproduction do not neatly align with the images of the document.
Original page numbers are given within the text as /^n.

Riemann 1880

OCR Quality and most common errors

OCR seems to be accurate for most normal words.
Dashes at the end of the line are transcribed correctly about half of the time.
Sometimes letters are missing in the transcription due to being a little faded in the scan.
Lots of unique symbols, need to be added manually

Weitzmann 1860

OCR Quality and most common errors

The fraktur double dash "=" needs to be replaced with "-" throughout
Angled dashes generally well recognized at the end of lines
Pagenumbers sometimes not recognized properly (?)
Letters at the beginning of lines sometimes missing (?)

Weitzmann 1861

OCR Quality and most common errors

The automated line detection is completely faulty for most of the document.

Kunkel 1863

Scan and OCR Quality

The scanned pages have a lot of visual noise, leading Transkribus to detect lots of wrong text regions within it. Drawing baselines manually is almost worth it timewise, I feel.
also takes some time to merge lines, but is manageable.
The text is reasonably well recognized, but does need careful correction, partly due to the noise interfering with the text.
Some letters on the left side are cutoff in the scan and need to be discerned by context

Unusual spellings

Uses a "2c." looking shorthand for "etc."