2 Preparing Texts - SunoikisisDC/SunoikisisDC-2024-2025 GitHub Wiki

Preparing texts and data cleaning

SunoikisisDC Digital Classics: Session 2

Date: Thursday January 23, 2025. 16:00-17:30 GMT.

Convenors: Jonathan Blaney (Cambridge University), Gabriel Bodard (University of London), Katharine Shields (King's College London)

Youtube link: https://youtu.be/Or-SaNznWz0

Slides: Combined slides (PDF)

Outline

This session following from the preceding one on sources of open texts, with a discussion of the importance of text cleaning and data preparation to any digital analysis or other process. We consider different processes that are more or less tolerant of "messy" texts, including poor OCR and similar artefacts, and highlight the importance of including a realistic level of text preparation in your project planning budget. We look at a few options for cleaning and repairing large quantities of text, before offering a simple tutorial to regular expressions, which can be used to remove repetitive and predictable unwanted features across one or multiple texts, including at massive scale.

Required readings

In lieu of required readings this week only, please work through the following tutorial in advance of the class. It will set you in good stead for the exercise and discussion.
- Understanding Regular Expressions (at Programming Historian)

Resources

Brief intro to Regular Expressions at Wikipedia
Regex tutorial from RegexOne
Regex cheatsheet or Quick start
Regex 101 Tester
Regexr Regex Tester

Exercise

For this exercise, you will need a text editor that handless Regular Expressions. For example VisualStudioCode, SublimeText, Notepad++, etc.

Work through a Regular Expressions tutorial (e.g. RegexOne; Programming Historian) until you understand the basic syntax.
Find a Greek or Latin text that interests you in Vicifons, Βικιθήκη, Oxford Text Archive (Latin), Gutenberg (Latin) or Gutenberg (Ancient Greek). Copy the plain text into a blank window in your text editor.
Use Regex to remove any non-text artefacts (line or chapter numbers, notes, annotations, etc.) from the text.
At each step, make a careful note of what you have done, and think about what features you may inadvertently lose through this process. (E.g. numbers that are part of the text; bracketed passages that are not annotations.)

Optional exercise #1

This exercises follows on from the demonstration on cleaning bibliographic records. The files used in the session are in this folder.

Using VS Code, a text editor of your choice, or Python, edit the file sample-bibliography.bib to:

Delete lines containing fields not always present in all the records (such as lccn and edition).
Clean up the ISSNs to remove any hyphens (note that you don't necessarily have to do this in one pass).
Reorder the author fields so they are firstname(s) surname rather than surname, firstname(s), eg Plotnick, Rachel becomes Rachel Plotnick
Now try to create the original order using regular expressions. Are there any pitfalls with this?

If you want to use Python but don't have it installed, there are online, Python-specific regex testers, such as https://pythex.org/. You can also run Python in various online environments, such as Google Colab (requires a Google account).

If you want to install Python on your laptop, you need to install Python 3. There are two main options, the first being a bit easier:

Install Anaconda for your operating system. The advantage here is that you get Jupyter Notebook as part of Anaconda. Note that Anaconda becomes your main way of installing Python libraries.
Install Python from Python itself. You will then need to install Jupyter Notebook separately. With this version of Python you install libraries using the pip command, which in this case would be pip3 install notebook, run from the command line. If you're not familiar with the command line, and don't want to learn it, the Anaconda option above allows you to avoid it. There are many pip tutorials; if it seems complicated...it is.

Optional exercise #2

Download a linguistically annotated text, either from the Diorisis corpus, or from the Perseus Ancient Greek and Latin Treebank (e.g. Aesop 1–8 (direct link; right-click to download)).
Using the advanced Regex methods you have learnt, (a) create a list of only the lemmas in this text, and save as a new file; (b) create a list of only the original forms in this text, and save as a new file.
Is there any other information you might have lost from these files? Can you think of a way to retain punctuation, for example? Could you get rid of those multiple line breaks between each word? (Or only some of them, as otherwise your text has become one long paragraph!)

Keep a careful note of all processes, and be prepared to report your results back to class.