6 Analysing Texts - SunoikisisDC/SunoikisisDC-2024-2025 GitHub Wiki
Analysing and visualising texts
SunoikisisDC Digital Classics: Session 6
Date: Thursday February 27, 2025. 16:00-17:30 GMT.
Convenors: Kaspar Beelen (University of London), Gabriel Bodard (University of London), Megan Bushnell (Oxford Text Archive)
Youtube link: https://youtu.be/xNihExxxOy0
Slides: Combined slides (PDF)
Outline
This sessions introduces theory and practice of various digital methods for the exploration, analysis and visualisation of historical texts. We begin with theoretical discussion of quantitative, stylistic and computational linguistic approaches to text analysis, defining terms, some history of the discipline, and an overview of tools and codebases available. The second half of the session is a practical demonstration of using the Voyant Tools reading and analysis environment, showing examples in English, Latin and Greek and some of the visulaisation modules in Voyant. We end with a suggested exercise for students to take away and try in their own time, and a general discussion.
Required readings
- Hawkins, Laura F. 2018. “Computational Models for Analyzing Data Collected from Reconstructed Cuneiform Syllabaries”, Digital Humanities Quarterly 12.1. Available: http://digitalhumanities.org:8081/dhq/vol/12/1/000368/000368.html.
- Rodda, Mar A., and Barbara McGillivray. 2024. “Computational Valency Lexica and Homeric Formularity.” Journal of Greek Linguistics 24(2). Pre-print: https://arxiv.org/abs/2208.10795.
Further readings
- Peter Broadwell, Jack W. Chen & David Shepard. 2019. "Reading the Quan Tang shi: Literary History, Topic Modeling, Divergence Measures." Digital Humanities Quarterly 13.4. Available: https://digitalhumanities.org/dhq/vol/13/4/000434/000434.html
- Victoria Beatrix Fendel & Matthew T. Ireland. 2023. "Discourse cohesion in Xenophon’s On Horsemanship through Sketch Engine." Digital Humanities Quarterly 17.3. Available: https://digitalhumanities.org/dhq/vol/17/3/000683/000683.html
- Anjalie Field. 2016. "An Automated Approach to Syntax-based Analysis of Classical Latin." Digital Classics Online 2,3. Available: https://doi.org/10.11588/dco.2016.0.32315
- Robert Gorman. 2022. "Universal Dependencies and Author Attribution of Short Texts with Syntax Alone." Digital Humanities Quarterly 16.2. Available: https://digitalhumanities.org/dhq/vol/16/2/000606/000606.html
- Barbara McGillivray, Thierry Poibeau & Pablo Ruiz Fabo. 2020. "Digital Humanities and Natural Language Processing: “Je t’aime... Moi non plus”." Digital Humanities Quarterly 14.2. Available: https://digitalhumanities.org/dhq/vol/14/2/000454/000454.html
- Geoffrey Rockwell & Stéfan Sinclair. 2016. Hermeneutica. Computer-Assisted Interpretation in the Humanities. Cambridge, Massachusetts, MIT Press. Companion website: http://hermeneuti.ca/.
- Justin A. Stover & Mike Kestermont. 2016. “The Authorship of the Historia Augusta: Two new computational studies.” BICS 59.2, pp. 140-157. Available: https://doi.org/10.1111/j.2041-5370.2016.12043.
- Michaela Mahlberg & Catherine Smith. 2012. "Dickens, the suspended quotation and the corpus", Language and Literature: International Journal of Stylistics, 21.1, 51-65. Available: https://journals.sagepub.com/doi/pdf/10.1177/0963947011432058.
- Marton Ribary & Barbara McGillivray. 2020. "A Corpus Approach to Roman Law Based on Justinian's Digest", Informatics, 7.4. Available: https://www.mdpi.com/2227-9709/7/4/44.
- Martina Astrid Rodda. 2024. "Reconsidering the computer’s role in literary studies through Levison 1964" Bulletin of the Institute of Classical Studies 67-1 (June 2024): 3–8. DOI: https://doi.org/10.1093/bics/qbae013
- Martin Wynne (ed.). c. 2005. Developing Linguistic Corpora: a Guide to Good Practice, AHDS Guides to Good Practice (Oxford: Oxbow Books). Available: https://icar.cnrs.fr/ecole_thematique/contaci/documents/Baude/wynne.pdf.
- Vaclav Brezina. 2018. Statistics in Corpus Linguistics: A Practical Guide (Cambridge: CUP).
Resources
Exercise
Try it out for yourself now in Voyant Tools!
-
First pick a text or group of texts to work with. Your text(s) must be digital and either in plain text, HTML, XML, PDF, RTF, or Word format. You might consider using the Diorisis Ancient Greek Corpus, or texts from the Oxford Text Archive (not from the OTA Legacy Collection!), or texts of your own. Voyant also has some texts available to load in. Keep in mind that distant reading tools like Voyant work best with corpora made out of many texts, so ideally pick several texts or an especially long, segmented one. Try to choose texts by the same author, or texts written in the same language in the same period, or texts of a similar type (poetry, prose, sermons, legal texts, etc.).
-
Once you have selected your texts, download them, and then load them into Voyant. Try experimenting with the various analysis tools to answer the questions below:
- What are the most common words in your text(s)? Are they what you expected? If you provide stop words, do your results change?
- If you uploaded several texts (or a segmented text), do you notice any patterns in the distinctive words discovered for each text or segment?
- What are the differences between the Collocates, Links, and TermsBerry tools?
- Are there any repeated segments across your text(s)? Why do you think they might be repeated? Or if there are none – why potentially are there none?
- What is the most readable text (or segment)? How did they calculate this readability metric?
- Now reflect on this process. Consider the questions below:
- What did distant reading tell you that close reading could not?
- Are there any types of texts that you think Voyant could not handle? Are there any research questions where Voyant might not be useful?
- Which tool or visualization did you personally find the most useful and why?
- Did you have any difficulties using any of the tools? Were you able to determine why?
- Do you feel equipped to understand the metrics in Voyant tools?