MT Archive Ingestion Checklist - acl-org/acl-anthology GitHub Wiki

So you want to help ingest the many conferences in the MT Archive. Wonderful!

Background

The MT Archive is a large collection of papers in machine translation reaching back to the beginning of the field. The site was created and maintained by John Hutchins until around 2015 as a wonderful service to the community. The one downside is that he built all of the resources by hand, using Microsoft Word to make documents and then exporting them to HTML.

Through a collaboration with the EAMT, the Anthology has undertaken the two-step task of (a) digitizing the resources in the Archive and (b) ingesting them into the Anthology. We have mostly completed step (a), and are now hoping to crowdsource step (b), ingesting conferences one by one. This document will tell you how to do that.

Important resources

We have converted the MT Archive into TSV files, roughly organized by venue and year:

https://github.com/mardub1635/mt-archive

There is a master list in a Spreadsheet here:

https://docs.google.com/spreadsheets/d/1fpxmdV_BPwR6BQHyU9VJQxXeSOmy4__5nQCHBEviyAw/edit?usp=sharing

Basic task

Note that there are many special cases for volumes being ingested, partially due to the fact that MT Archive was created and managed by hand, and not from a database. These will require editorial judgments. However, some of them are simple and straightforward. For such files, the following ingestion process should work.

  1. Pick one of the venues in the mt-archive repo, under data/. For example, let's do 1994.amta.

  2. Find the corresponding page in the MT archive. You can do this by (a) following the link in the Conference List spreadsheet or poking around the MT Archive site. For 1994.amta, that brings us here.

  3. Verify that there is a bijection between the titles on the webpage and in the TSV file.

  4. Check name spellings and so on. Names with two parts (e.g., "Matt Post") can stay as they are. Names with more parts should be manually split into last and first names (e.g., "Van Durme, Benjamin"). There are many typos, so please take some time to find them. Any corrections, please issue as a PR against that repo.

  5. Run the ingestion script (you need a copy of the acl-anthology repo).

    acl-anthology/bin/ingest_tsv.py mt-archive/data/amta/1994.amta.tsv mt-archive/conference-list.tsv
    

    (The conference-list.tsv file is just the Google spreadsheet above, exported to TSV).

This will do two things: (a) create a file acl-anthology/data/xml/1994.amta.xml and (b) download and copy the PDFs to ~/anthology-files/pdf/amta/*.

  1. Add the XML file to the Anthology repo, and create a PR against our master branch. Tarball up the PDF files, and add them to the PR.

Complications

There are many complications. If you encounter one of these, it will likely have to be handled by the Anthology director, so please move on to a more simple conference.

  • A conference with attached workshops
  • Conferences with multiple volumes

Priority

Our first priority is the conferences EAMT, AMTA, and MT Summit. We prefer to ingest them chronologically.