Datasets - Giovanni1085/KB_OCR_impact GitHub Wiki
Brief documentation on datasets
Please see the URLs below or ask the KB to get access to the datasets.
Meertsens 17th century newspapers
- Content: 6.425 newspapers from the 17th century.
- URL: None, articles are added to https://www.delpher.nl/nl/kranten. News item: https://www.meertens.knaw.nl/cms/nl/nieuws-agenda/nieuws-overzicht/278-2020/146210-crowdsourcing-maakt-zeventiende-eeuwse-kranten-op-delpher-beter-doorzoekbaar
- Gist of it: it contains OCRed texts and ground truth (both as plain text)
DBNL OCR
- Contents: 220 books.
- URL: https://lab.kb.nl/dataset/dbnl-ocr-data-set.
- Gist of it: it contains OCRed texts (
.txt
) and ground truth (.tei
).
IMPACT
- Contents: ~4.5k pages from varied sources
- URL: https://lab.kb.nl/dataset/ground-truth-impact-project.
- Gist of it: it contains
tiff
images and ground truth inXML Page
format.
IMPACT Books
- Contents: 2055 book pages, ranging from 1630 until 1796 from Early Dutch Books Online and Digitale Topstukken
- URL: https://lab.kb.nl/dataset/ground-truth-impact-project.
- Gist of it: it contains
tiff
images and ground truth inXML Page
format.
IMPACT ANP
- Contents: 205 typewritten radio bulletins from 1937, from Delpher
- URL: https://lab.kb.nl/dataset/ground-truth-impact-project.
- Gist of it: it contains
tiff
images and ground truth inXML Page
format.
IMPACT Surf
- Contents: 2k pages from newspapers.
- URL: None (in Surf), but used here https://lab.kb.nl/about-us/blog/%E2%80%8Bnewspaper-ocr-quality-–-what-do-we-have-and-how-can-we-improve-it
- Gist of it: it contains two OCR versions and two ground truth versions, all in ALTO.