Ingestion Checklist - acl-org/acl-anthology GitHub Wiki

Instructions

The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.

Checklist (for every volume)

  1. Make sure the branch is merged with the latest master branch
  2. Ensure that there are editors listed in the <meta> block
  3. If it's a workshop, add a <venue>ws</venue> tag
  4. Add events to their relevant SIGs
  5. Look at the venue listing for prior years, and ensure that the new volume titles are consistent. You can do this by clicking on the venue name from a paper page, which will take you to the vendor listing.
  6. Navigate to the event page preview (e.g., https://preview.aclanthology.org/icnlsp-ingestion/events/icnlsp-2021/), and page through, to see if there are any glaring mistakes
  7. Skim through the complete listing, looking for mis-parsed author names.
  8. Download the frontmatter and verify that the table of contents matches at least three randomly-selected papers
  9. Download 3–5 PDFs (including the first and last one) and make sure they are correct (title, authors, page numbers).

Process

This section contains technical details for the ingestion process. The checklist above should be used after this is done.

  1. Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
  2. Download the ingestion data zip file DATA.zip to $DATA
  3. Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
  4. Check out a new branch git checkout -b YOUR_BRANCH_NAME under ACL Anthology repo
  5. Run ingestion command python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings
  6. Run command python bin/write_bibkeys_to_xml.py -c to back ingest bibkey for the newly generated xml file
  7. Run command git diff data/yaml/venues.yaml to check file venues.yaml. Specifically, remove all numbers of venues e.g. The First, 32th
  8. Update data/yaml/joint.yaml when needed
    • Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
    • Make sure to update collections-volume IDs, not just the collection IDs
  9. Check meta files in $DATA and modify data/yaml/sigs/sig files when needed
  10. Check all newly generated .xml files
    • Check that editor names are split correctly, spot check a few authors
    • Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
    • The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
    • Make sure location, year, etc are reasonable
  11. Run command make check and make sure all tests pass
  12. Run command git add ABOVE_NEWLY_GENERATED_FILES
  13. Run command git commit -m “YOUR_MESSAGE” to commit your changes
  14. Run command git push origin YOUR_BRANCH_NAME to push your changes
  15. Go on git, open a new pull request, assign reviewers to acl-org/anthology, choose ingestion under labels
  16. Under dir ~/anthology-files, upload all generated attachments and pdfs by running e.g. rsync -ave ssh pdf anth:anthology-files
  17. Clean out dir ~/anthology-files

CL and TACL ingestion

There are several different steps for CL and TACL ingestion:

  1. Connect to MIT press
  2. Download all new files
  3. Ingest with ingest_mitpress.py
⚠️ **GitHub.com Fallback** ⚠️