Home - hcts-hra/ecpo-fulltext-experiments GitHub Wiki

This wiki documents the process of extracting full text from the 1919–1940 issues of the Republican Chinese entertainment newspaper 晶報 Jīngbào.


Page Segmentation:

  1. Rule-based Approaches

    1.1 Morphological Opening to Connect Text Blocks

    1.2 Finding and Connecting Separators

  2. ML-driven Approaches

    2.1 Fine-tuning eynollah

Character Segmentation Using HRCenterNet

  1. The MTHv2 Dataset

Building an OCR Classifier:

  1. Ground Truth Stats

  2. First Experiments

  3. Extracting Character Images

  4. Synthesizing Artificial Character Images

  5. OCR Correction Using Language Models