Ingestion Checklist - acl-org/acl-anthology GitHub Wiki
The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.
- Make sure the branch is merged with the latest
masterbranch - Ensure that there are editors listed in the
<meta>block - If it's a workshop, add a
<venue>ws</venue>tag - Add events to their relevant SIGs
- Look at the venue listing for prior years, and ensure that the new volume titles are consistent. You can do this by clicking on the venue name from a paper page, which will take you to the vendor listing.
- Navigate to the event page preview (e.g., https://preview.aclanthology.org/icnlsp-ingestion/events/icnlsp-2021/), and page through, to see if there are any glaring mistakes
- Skim through the complete listing, looking for mis-parsed author names.
- Download the frontmatter and verify that the table of contents matches at least three randomly-selected papers
- Download 3–5 PDFs (including the first and last one) and make sure they are correct (title, authors, page numbers).
This section contains technical details for the ingestion process. The checklist above should be used after this is done.
- Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
- Download the ingestion data zip file DATA.zip to $DATA
- Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
- Check out a new branch
git checkout -b YOUR_BRANCH_NAMEunder ACL Anthology repo - Run ingestion command
python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings - Run command
python bin/write_bibkeys_to_xml.py -cto back ingest bibkey for the newly generated xml file - Run command
git diff data/yaml/venues.yamlto check filevenues.yaml. Specifically, remove all numbers of venues e.g. The First, 32th - Update
data/yaml/joint.yamlwhen needed- Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
- Make sure to update collections-volume IDs, not just the collection IDs
- Check meta files in $DATA and modify
data/yaml/sigs/sigfiles when needed - Check all newly generated .xml files
- Check that editor names are split correctly, spot check a few authors
- Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
- The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
- Make sure location, year, etc are reasonable
- Run command
make checkand make sure all tests pass - Run command
git add ABOVE_NEWLY_GENERATED_FILES - Run command
git commit -m “YOUR_MESSAGE”to commit your changes - Run command
git push origin YOUR_BRANCH_NAMEto push your changes - Go on git, open a new pull request, assign reviewers to
acl-org/anthology, chooseingestionunderlabels - Under dir
~/anthology-files, upload all generated attachments and pdfs by running e.g.rsync -ave ssh pdf anth:anthology-files - Clean out dir
~/anthology-files
There are several different steps for CL and TACL ingestion:
- Connect to MIT press
- Download all new files
- Ingest with
ingest_mitpress.py