2 Preparing Texts - SunoikisisDC/SunoikisisDC-2024-2025 GitHub Wiki

Preparing texts and data cleaning

SunoikisisDC Digital Classics: Session 2

Date: Thursday January 23, 2025. 16:00-17:30 GMT.

Convenors: Jonathan Blaney (Cambridge University), Gabriel Bodard (University of London), Katharine Shields (King's College London)

Youtube link: https://youtu.be/Or-SaNznWz0

Slides: tba

Outline

This session following from the preceding one on sources of open texts, with a discussion of the importance of text cleaning and data preparation to any digital analysis or other process. We consider different processes that are more or less tolerant of "messy" texts, including poor OCR and similar artefacts, and highlight the importance of including a realistic level of text preparation in your project planning budget. We look at a few options for cleaning and repairing large quantities of text, before offering a simple tutorial to regular expressions, which can be used to remove repetitive and predictable unwanted features across one or multiple texts, including at massive scale.

Required readings

  • In lieu of required readings this week only, please work through the following tutorial in advance of the class. It will set you in good stead for the exercise and discussion.

Further readings

Resources

Exercise

For this exercise, you will need a text editor that handless Regular Expressions. For example VisualStudioCode, SublimeText, Notepad++, etc.

  1. Work through a Regular Expressions tutorial (e.g. RegexOne; Programming Historian) until you understand the basic syntax.
  2. Find a Greek or Latin text that interests you in Vicifons, Βικιθήκη, Oxford Text Archive (Latin), Gutenberg (Latin) or Gutenberg (Ancient Greek). Copy the plain text into a blank window in your text editor.
  3. Use Regex to remove any non-text artefacts (line or chapter numbers, notes, annotations, etc.) from the text.
  4. At each step, make a careful note of what you have done, and think about what features you may inadvertently lose through this process. (E.g. numbers that are part of the text; bracketed passages that are not annotations.)

Optional exercise

  1. Download a linguistically annotated text, either from the Diorisis corpus, or from the Perseus Ancient Greek and Latin Treebank (e.g. Aesop 1–8 (direct link; right-click to download)).
  2. Using the advanced Regex methods you have learnt, (a) create a list of only the lemmas in this text, and save as a new file; (b) create a list of only the original forms in this text, and save as a new file.
  3. Is there any other information you might have lost from these files? Can you think of a way to retain punctuation, for example? Could you get rid of those multiple line breaks between each word? (Or only some of them, as otherwise your text has become one long paragraph!)

Keep a careful note of all processes, and be prepared to report your results back to class.