4. HTR OCR - SunoikisisDC/SunoikisisDC-2023-2024 GitHub Wiki

HTR and OCR from papyrus to codex

SunoikisisDC Digital Classics and Byzantine Studies: Session 4

Date: Monday April 29, 2024. 16:00-17:30 BST = 17:00-18:30 CEST.

Convenors: Maxime Guénette (Université de Montréal), Isabelle Marthot-Santaniello (Universität Basel), John Pavlopoulos (Athens University of Economics and Business), Paraskevi Platanou (National and Kapodistrian University of Athens)

Youtube link: youtu.be/wlYFIiFXNDg

Slides: Combined slides (PDF)

Outline

The session will be divided into three parts. The first part will focus on the digital paleography of ancient Greek manuscripts, specifically papyri, renowned in Classics for pioneering digital approaches. However, the unique challenges posed by papyri, such as their small size, degraded preservation, and diverse scripts over time, require specialized digital methodologies. This segment will discuss the significance and ongoing research in digital paleography of papyri. Moving on, the second part will address the challenges and mitigation strategies for Handwritten Text Recognition (HTR) in Greek manuscripts from the 10th to the 16th century. This project involves transcribing images from the Greek manuscripts of the Bodleian Library to create a new dataset for Handwritten Paleographic Greek Text Recognition. Data will be classified by century, and experimentation will be conducted using the Transkribus Artificial-Intelligence platform. Lastly, the third part will introduce the traditional pipeline of an HTR project using the eScriptorium platform. Using the Codex Palatinus graecus 23, a 10th-century Byzantine manuscript, as a case study, this segment will delve into the state-of-the-art HTR technology, the advantages and limitations of eScriptorium, its functionalities, and practical tips for project implementation. Often overlooked aspects, such as transcription volume, Unicode standardization, and ontology usage, will be emphasized.

Required readings

  • Clérice, Thibault, Malamatenia Vlachou-Efstathiou, and Alix Chagué. ‘CREMMA Medii Aevi: Literary Manuscript Text Recognition in Latin’. Journal of Open Humanities Data 9, no. 1 (2023). https://doi.org/10.5334/johd.97.
  • Marthot-Santaniello, I. "D-scribes Project and Beyond: Building a Virtual Research Environment for the Digital Palaeography of Ancient Greek and Coptic Papyri", ed. Claire Clivaz and Garrick V. Allen, special issue, Classics@ 18, 2021 Online version.
  • Marthot-Santaniello, I., Manh Tu Vu, Olga Serbaeva, Marie Beurton-Aimar, “Stylistic Similarities in Greek Papyri Based on Letter Shapes: A Deep Learning Approach” in M. Coustaty and A. Fornés (eds), Document Analysis and Recognition – ICDAR 2023 Workshops. Lecture Notes in Computer Science, vol 14193. Springer, Cham, 2023, p. 307–323. https://doi.org/10.1007/978-3-031-41498-5_22
  • Pavlopoulos, J., Kougia, V., Platanou, P., Shabalin, S., Liagkou, K., Papadatos, E., Essler, H., Camps, J., & Fischer, F. (2023). Error Correcting HTR’ed Byzantine Text. 10.21203/rs.3.rs-2921088/v1.
  • Pavlopoulos, J., Kougia, V., Platanou, P., & Essler, H. (2023). Detecting Erroneously Recognized Handwritten Byzantine Text. In C. Clivaz & V. A. Garrick (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 7818–7828). Singapore: Association for Computational Linguistics.
  • Perdiki, E., & Konstantinidou, M. (2021). Handling Big Manuscript Data. In C. Clivaz & V. A. Garrick (Eds.), Classics@ 18 (Ancient Manuscripts and Virtual Research Environments, special issue). Retrieved from https://classics-at.chs.harvard.edu/classics18-perdiki-and-konstantinidou/
  • Perdiki, Elpida. (2022). “Review of 'Transkribus: Reviewing HTR training on (Greek) manuscripts.” RIDE 15 (2022). doi: 10.18716/ride.a.15.6.
  • Pinche, Ariane, and Peter Stokes. ‘Historical Documents and Automatic Text Recognition: Introduction’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2024): 13247. https://doi.org/10.46298/jdmdh.13247.
  • Platanou, P., Pavlopoulos, J., & Papaioannou, G. (2022). Handwritten Paleographic Greek Text Recognition: A Century-Based Approach. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6585–6589). Marseille, France: European Language Resources Association.
  • Tsochatzidis, Lazaros, Symeon Symeonidis, Alexandros Papazoglou, and Ioannis Pratikakis. ‘HTR for Greek Historical Handwritten Documents’. Journal of Imaging 7, no. 12 (2021): 260. https://doi.org/10.3390/jimaging7120260.

Further readings

  • Calvelli, Lorenzo, Federico Boschetti, and Tatiana Tommasi. ‘EpiSearch. Identifying Ancient Inscriptions in Epigraphic Manuscripts’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2023): 10417. https://doi.org/10.46298/jdmdh.10417.
  • Chagué, Alix, and Thibault Clérice. ‘“I’m Here to Fight for Ground Truth”: HTR-United, a Solution towards a Common for HTR Training Data’. In Digital Humanities 2023: Collaboration as Opportunity, 2023. https://inria.hal.science/hal-04094233.
  • Chagué, Alix, Thibault Clérice, et Laurent Romary. « HTR-United : un écosystème pour une approche mutualisée de la transcription automatique des écritures manuscrites », 2022. https://inria.hal.science/hal-04124743.
  • Chauhan, Rohan. ‘Train Your Own OCR/HTR Models with Kraken, Part 1’. The Digital Orientalist (blog), 26 September 2023. https://digitalorientalist.com/2023/09/26/train-your-own-ocr-htr-models-with-kraken-part-1/.
  • Clérice, Thibault. ‘You Actually Look Twice At It (YALTAi): Using an Object Detection Approach Instead of Region Segmentation within the Kraken Engine’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2023): 9806. https://doi.org/10.46298/jdmdh.9806.
  • Gabay, Simon, Ariane Pinche, Kelly Christensen, and Jean-Baptiste Camps. ‘SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles’, 2023. https://hal.science/hal-04343404.
  • Kaddas, Panagiotis, Konstantinos Palaiologos, Basilis Gatos, Vassilis Katsouros, and Katerina Christopoulou. ‘A System for Processing and Recognition of Greek Byzantine and Post-Byzantine Documents’. In Document Analysis and Recognition - ICDAR 2023, edited by Gernot A. Fink, Rajiv Jain, Koichi Kise, and Richard Zanibbi, 366–76. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2023. https://doi.org/10.1007/978-3-031-41685-9_23. Kindt, Bastien, Chahan Vidal-Gorène, and Saulo Delle Donne. ‘Analyse Automatique Du Grec Ancien Par Réseau de Neurones. Évaluation Sur Le Corpus De Thessalonica Capta’. Bulletin de l’Académie Belge Pour l’Étude Des Langues Anciennes et Orientales 1011 (2022): 537–62. https://doi.org/10.14428/babelao.vol1011.2022.65073.
  • Markou, K., L. Tsochatzidis, K. Zagoris, A. Papazoglou, X. Karagiannis, S. Symeonidis, and I. Pratikakis. ‘A Convolutional Recurrent Neural Network for the Handwritten Text Recognition of Historical Greek Manuscripts’. In Pattern Recognition. ICPR International Workshops and Challenges, edited by Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, 249–62. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2021. https://doi.org/10.1007/978-3-030-68787-8_18.
  • Perdiki, Elpida. ‘Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2023): 10419. https://doi.org/10.46298/jdmdh.10419.
  • Pinche, Ariane, Thibault Clérice, Alix Chagué, Jean-Baptiste Camps, Malamatenia Vlachou-Efstathiou, Matthias Gille Levenson, Olivier Brisville-Fertin, et al. ‘CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts’, 2024. https://inria.hal.science/hal-04346939.
  • Pinche, Ariane. ‘Generic HTR Models for Medieval Manuscripts. The CREMMALab Project’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2023): 10252. https://doi.org/10.46298/jdmdh.10252.
  • Reggiani, N. 2017. Digital Papyrology I: Methods, Tools and Trends. Berlin.
  • Romein, C. Annemieke, Tobias Hodel, Femke Gordijn, Joris J. Van Zundert, Alix Chagué, Milan Van Lange, Helle Strandgaard Jensen, et al. ‘Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done’. Journal of Data Mining & Digital Humanities Documents historiques et reconnaissance automatique de textes (2024): 10403. https://doi.org/10.46298/jdmdh.10403.
  • Stokes, Peter A., Benjamin Kiessling, Daniel Stökl Ben Ezra, Robin Tissot, and El Hassane Gargel. ‘The eScriptorium VRE for Manuscript Cultures’. Edited by Claire Clivaz and Garrick V. Allen. Classics@ Journal: Ancient Manuscripts and Virtual Research Environments 18 (2021). https://classics-at.chs.harvard.edu/classics18-stokes-kiessling-stokl-ben-ezra-tissot-gargem/.
  • Vesuvius Challenge on Herculaneum papyri: https://scrollprize.org/
  • Videos from the conference Perceptions of Writing in Papyri. Crossing Close and Distant Readings https://d-scribes.philhist.unibas.ch/en/events-1/papyri-conference/

Resources

Exercise

Choose at least two options

Option 1

This reading (Marthot-Santaniello, I. "D-scribes Project and Beyond: Building a Virtual Research Environment for the Digital Palaeography of Ancient Greek and Coptic Papyri", ed. Claire Clivaz and Garrick V. Allen, special issue, Classics@ 18, 2021 Online version) gives you a presentation of online tools for papyrology and background on d-scribes project. The experiment described in another reading (Isabelle Marthot-Santaniello, Manh Tu Vu, Olga Serbaeva, Marie Beurton-Aimar, “Stylistic Similarities in Greek Papyri Based on Letter Shapes: A Deep Learning Approach” in M. Coustaty and A. Fornés (eds), Document Analysis and Recognition – ICDAR 2023 Workshops. Lecture Notes in Computer Science, vol 14193. Springer, Cham, 2023, p. 307–323. https://doi.org/10.1007/978-3-031-41498-5_22) showed a strong similarity between the mus of two papyri TM 60589 and TM 60333. We now have to evaluate the solidity of this similarity proposed by the AI

  1. First, what are these papyri? Go on https://www.trismegistos.org/ and collect the basic metadata: current location (collection, inventory number), content, date, and provenance. Where should we look if we want further bibliography on the papyri?
  2. Then look yourself at the papyri. Open the following links in 2 separate windows

What are the useful features to compare the handwritings?

  1. Interpret the results. How similar are the handwritings? You can select other images (in the drop-down menu on the left) to get a sense of how the other papyri look like. How do you interpret these similarities?

Option 2

Requirements

Check the compatibility of the Transkribus tool with your operating system (Windows, macOS, Linux) and ensure that your device meets the minimum system requirements specified by Transkribus.

Steps

In the How-To guides provided by Transkribus you can find information on the steps you need to follow:

  1. Registration/Login
  2. User interface overview
  3. Uploading files
  4. Layout recognition
  5. Transcribing manually [6. Repeat (3), 4 ,5]
  6. Training Model
  7. Text Recognition

For this exercise you will use images (transcriptions are provided) of Greek manuscripts from the Bodleian Library which can be found in this repository. Suggestion: since the main scope of this exercise is to familiarize students with the challenges of HTR, you may use 4 images for training, 2 images for validation and 1 image for testing.

After having completed the exercise successfully, reflect on the difficulties you faced along the steps you followed. What are the challenges an HTR system faces when recognizing Greek manuscripts of Byzantine times?

Option 3

Requirements

Google Colab runs entirely in the cloud, so you will need a stable internet connection to access it.

Steps

Open the exercise and follow the steps included: exercise_TranskribusDataAnalysis

Option 4

Requirements

You will need to either install eScriptorium locally on your computer (Linux or Mac) or ask for access to a server already hosting a version of eScriptorium. The second option is more convenient for most people because installing eScriptorium locally requires some coding knowledge. For the sake of this exercise, we will provide an account for the CREMMA server.

Steps

  1. Choose any manuscript in Ancient Greek that can be downloaded as a PDF, JPG or imported by IIIF. We suggest looking at the digital collection of the Bibliotheca Palatina in Heidelberg which contains hundreds of Ancient Greek manuscripts, for example the Codex Palatinus graecus 23 (The Palatine Anthology). Download 2-3 pages as JPG files on your computer.
  2. Go to eScriptorium CREMMA server and log in with username “formation_maxime” and password “formation2024”. Go to My Project on the top right of your screen and choose the project “Workshop_Sunoikisis”.
  3. You can now create a new document for your project. You now have to fill some information: the project’s name, the writing system (script) of your document (Greek in our case), the read direction (left to right for Ancient Greek/Latin) and the position of the line in the polygon mask for the segmentation (the default Baseline is fine). You can also provide additional metadata like the location of the manuscript, his DOI/URN, etc.
  4. Now, you need to upload the segmentation and the HTR models specialized in Ancient Greek. Go to this URL and download the file meleagre-NFD-finetuned.mlmodel. Return to eScriptorium and go to My Models, and then Upload a model. You can now upload one model at a time.
  5. You are now ready to upload your images. You can go back to your document and then click to access the Images section. You can either drag and drop them or click on the box at the top of your screen to pick them in your computer’s folders.
  6. After uploading your images, you can finally begin the segmentation process. Select the images we want to segment and click on the Segment option in blue. You can now choose our Ancient Greek Segmenter model, leave everything else on default and then click on Segment. The process will usually take a couple of seconds.
  7. To see the result of the segmentation, you can click on the Edit button. You can now add or delete lines and regions, rearrange polygons’ masks, change the type of line and regions, etc. Did the segmentation model do a good job?
  8. After correcting your segmentation, you are now ready to automatically transcribe your images. Go back to the Images section, select the images you want to transcribe and press the Transcribe blue button on the right of your screen. Then, select the Ancient Greek HTR model you uploaded earlier and press Transcribe. This usually takes some time, depending on how many images are being transcribed.
  9. You can now see the result of your automatic transcription by pressing the Edit button. You will need to switch from manual transcription to the one you used right before (usually the name of the model) near the gear button at the top right of your screen. You can now activate the transcription panel by pressing the Transcription button also at the top right of your screen. Did the model make a lot of mistakes?