Text specific information for transcription - DCMLab/ddd GitHub Wiki
On this page, I will collect helpful information for the transcription of each text. This is a continuously updated work-in-progress and anyone working on transcriptions should feel encouraged to contribute to it! One of the goals of this is to summarize the most common OCR errors and idiosyncrasies of layout and spelling that are present in the text, in order to help detect errors more reliably.
[So far, this is just a very rough outline]
Moritz Hauptmann - Die Natur der Harmonik und der Metrik
OCR Quality and most common errors
- Generally, the OCR transcriptions are very accurate
- Connecting dashes are almost never recognized as "angled dash", but as normal dashes.
- Umlauts are sometimes confused with their counterpart (ü = u ; ä = a, ö = o)
- Some lines are doubled in wrong paragraphs and need to be removed
- Sometimes merges words printed close to one another.
- "Digitized by" [Google] needs to removed on most pages.
Unusual spelling in Text
- Avoiding capital Ä? -> "Aesserlich-äusserste" (p. 25, 4-3)
- Sometimes capitalizes "Das" in the middle of a line (see pp. 5, 31, 336), when not used as an article to a noun. (i. e. "[...] über Das nachzudenken, was ihnen durch das natürliche Gefühl gesichert scheint", pp. 5f.)
Special layout choices
- Sometimes uses a long-dash at the end of a paragraph to signal a stronger break. (see p. 15)
Hugo Riemann - Ideen zu einer Lehre von den Tonvorstellungen
OCR Quality and most common errors
- OCR generally accurate for most normal text here.
- Umlauts: small problems, particularly capital Ö = O
- paragraphs are numbered as "- 2 -", which often need to be adjusted manually.
- Single quotation marks are often missing or wrongly replaced.
- Occasionally, the line recognition is broken for sections of a paragraph and needs to be redone manually.
Spelling and layout
- Sometimes the text itself, which is a reproduction of the original, may contain spelling errors that are likely not present in the original (c = e, rm = m, in = m - typical OCR errors on their part?)
- The entire text is a pdf-output of a modern reproduction of the source. The pages of said reproduction do not neatly align with the images of the document.
- Original page numbers are given within the text as /^n.
Riemann 1880
OCR Quality and most common errors
- OCR seems to be accurate for most normal words.
- Dashes at the end of the line are transcribed correctly about half of the time.
- Sometimes letters are missing in the transcription due to being a little faded in the scan.
- Lots of unique symbols, need to be added manually
Weitzmann 1860
OCR Quality and most common errors
- The fraktur double dash "=" needs to be replaced with "-" throughout
- Angled dashes generally well recognized at the end of lines
- Pagenumbers sometimes not recognized properly (?)
- Letters at the beginning of lines sometimes missing (?)
Weitzmann 1861
OCR Quality and most common errors
- The automated line detection is completely faulty for most of the document.
Kunkel 1863
Scan and OCR Quality
- The scanned pages have a lot of visual noise, leading Transkribus to detect lots of wrong text regions within it. Drawing baselines manually is almost worth it timewise, I feel.
- also takes some time to merge lines, but is manageable.
- The text is reasonably well recognized, but does need careful correction, partly due to the noise interfering with the text.
- Some letters on the left side are cutoff in the scan and need to be discerned by context
Unusual spellings
- Uses a "2c." looking shorthand for "etc."