ReleaseNotes - UB-Mannheim/AustrianNewspapers GitHub Wiki

2.0.0

What's new?

Release 2.0.0 is a revised version according to the OCR-D GT Guidelines Level 2. Changes were made to the textual content, baselines, polygonal features, region tags and IDs of the PAGE-XML files as well as to the README and the repo folder structure.

Changes to the project structure

  • Change folder structure and README according to OCRD-GT-Repo-Template
    • Keep Validation and Trainingsset
  • Delete gt linepairs

Enhancements PAGE-XML

  • Standardisation of glyphs
    • Double oblique hyphen (βΈ—)
    • Em dash (β€”) instead of En dash (–)
    • Different variations of asterisks uniformed to asterisk (*)
  • Enhancements and standardisation according to OCR-D Ground Truth Guidelines Level 2
    • Long s (ΕΏ)
    • R rotunda (ꝛ)
    • Fractions (ΒΌ Β½ ΒΎ ⅐ β…‘ β…’ β…“ β…” β…• β…– β…— β…˜ β…™ β…š β…› β…œ ⅝ β…ž)
    • Fraction slash (⁄) (U+2044), if
      • can't be transcribed by a unicode fraction representation
      • numerator and denominator are not on the same baseline height
  • Labeling of text regions
    • header
    • headings
    • paragraphs
    • footer
    • reference
  • Correcting reading order
  • Unique IDs based on the new reading order

Changes compared to release 1.1.0

Austrian Newspapers 2.0.0 provides revised transcriptions according to the OCR-D GT Guidelines Level 2.

Among others these revisions and enhancements include:

Austrian Newspapers 1.1.0 Austrian Newspapers 2.0.0
Amount of "long s" [ΕΏ] 57.599 58.629
Enhancement in % - 1.8
Amount of "double oblique hyphen" [βΈ—] 11.745 11.857
Enhancement in % - 0.9