1. Open Texts - SunoikisisDC/SunoikisisDC-2023-2024 GitHub Wiki
Free and open primary texts
SunoikisisDC Digital Classics and Byzantine Studies: Session 1
Date: Monday April 8, 2024. 16:00-17:30 BST = 17:00-18:30 CEST.
Convenors: Monica Berti (Universität Leipzig), Gabriel Bodard (University of London), Martina Filosa (Universität zu Köln)
Youtube link: youtu.be/TXl8Ap5KwW8
Slides: Combined slides (PDF)
Outline
The goal of this introductory session is to present free and open resources that collect data and metadata about primary texts in ancient Greek and Latin.
The session will focus on:
- Sources of texts (Wikisource, Gutenberg, Googlebooks, Lace, Loebulus, Scaife Viewer, Open Greek & Latin (OGL), Latin Library, etc.)
- Brief intro to Regular Expressions for cleaning up texts (e.g. stripping XML tags and other editorial artefacts).
Required readings
- Alison Babeu. 2019. "The Perseus Catalog: of FRBR, Finding Aids, Linked Data, and Open Greek and Latin". In M. Berti (ed.), Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution. De Gruyter Saur. Pp. 53-72. DOI: https://doi.org/10.1515/9783110599572-005
- Samuel J. Huskey. 2019. "The Digital Latin Library: Cataloging and Publishing Critical Editions of Latin Texts." In M. Berti (ed.), Digital Classical Philology. De Gruyter Saur. Pp. 19–33. DOI: https://doi.org/10.1515/9783110599572-003
Further readings
- Andrew Hardie. 2014. "Modest XML for Corpora: Not a standard, but a suggestion." ICAME Journal 38. DOI: https://doi.org/10.2478/icame-2014-0004
- B. McGillivray & A. Vatri. 2018. "The Diorisis Ancient Greek Corpus." Research Data Journal for the Humanities and Social Sciences 3.1, 55–65. DOI: https://doi.org/10.1163/24523666-01000013.
- Leonard Muellner. 2019. "The Free First Thousand Years of Greek". In M. Berti (ed.), Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution. De Gruyter Saur. Pp. 7-18. DOI https://doi.org/10.1515/9783110599572-002
- Bruce Robertson. 2019. "Optical Character Recognition for Classical Philology." In M. Berti (ed.), Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution. De Gruyter Saur. Pp. 117–136. DOI: https://doi.org/10.1515/9783110599572-008
Resources
- PerseusDL
- The Perseus Catalog
- Scaife Viewer Library
- Digital Latin Library
- Vicifons (Latin Wikisource); Βικιθήκη (Ancient Greek texts in Wikisource)
- Gutenberg (Latin) and Gutenberg (Ancient Greek)
- Latin texts in HathiTrust
- Latin e-books in Internet Archive
- Diorisis Ancient Greek Corpus
- LACE (OCRed Greek texts)
- Regex tutorial from RegexOne
- Understanding Regular Expressions (at Programming Historian)
- Regex cheatsheet or Quick start
Exercise
For this exercise, you will need a text editor that handless Regular Expressions. For example the free Atom or VisualStudioCode editors, or SublimeText with a free trial period.
- Work through a Regular Expressions tutorial (e.g. RegexOne; Programming Historian) until you understand the basic syntax.
- Find a Greek or Latin text that interests you in Vicifons, Βικιθήκη, or Gutenberg (Latin), Gutenberg (Ancient Greek). Copy the plain text into a blank window in your text editor.
- Use Regex to remove any non-text artefacts (line or chapter numbers, notes, annotations, etc.) from the text.
- At each step, make a careful note of what you have done, and think about what features you may inadvertently lose through this process. (E.g. numbers that are part of the text; bracketed passages that are not annotations.)
Optional exercise
- Download a linguistically annotated text, either from the Diorisis corpus, or from the Perseus Ancient Greek and Latin Treebank (e.g. Aesop 1–8 (direct link; right-click to download)).
- Using the advanced Regex methods you have learnt, (a) create a list of only the lemmas in this text, and save as a new file; (b) create a list of only the original forms in this text, and save as a new file.
- Is there any other information you might have lost from these files? Can you think of a way to retain punctuation, for example? Could you get rid of those multiple line breaks between each word? (Or only some of them, as otherwise your text has become one long paragraph!)
- Keep a careful note of all processes, and be prepared to report your results back to class.