Processing attachments and corrections - acl-org/acl-anthology GitHub Wiki

This page describes the process of handling attachments, corrections, and errata. This is a time-consuming task that is ideally done at least quarterly.

Attachments

  1. Create an ingestion folder in the shared Dropbox folder, e.g., Dropbox/Anthology/ingests/2019/2019-12-18-attachments.

  2. Download the Microsoft form data used to gather the attachments. Use Microsoft Excel or a compatible program to export it to a UTF-8-encoded CSV file. Save this file as attachments.csv

  3. Process the attachments using the script

    acl-anthology/bin/add_attachments.py attachments.csv
    

    This will go through each of the attachments, download them, do some minimal verification, and log everything to add_attachments.log. For successful attachments, it will edit the XML in the Anthology repo and put the file in a local mirror of the Anthology attachments, under ~/anthology-files/attachments/. For failed attachments, it will create a file add_attachments.log.$ANTH_ID.txt. This file contains an email you can manually send to the person (first line is email, second is subject, rest is body).

  4. Commit the repo changes after manually checking them and create a PR.

  5. Sync the locally mirrored files to the Anthology:

    cd ~/anthology-files
    rsync -azve ssh attachments/ aclweb:anthology-files/attachments/
    

    Where aclweb is an ssh alias to the Anthology host.

Corrections

  1. Start with the CSV-converted Excel Spreadsheet as above.

  2. Run the script bin/extract_corrections_for_processing.py CSV_FILE. This will create a corrections directory, with a file for each correction that was submitted. Ideally, this is one line, with three arguments that can be passed to the revision script.

  3. Manually inspect each file. Correct the explanation to a short, neutral, third-person, scientific account of the changes. Ensure that the file downloads correctly via wget.

  4. Run the script bin/add_revision.py ANTH_ID "DOWNLOAD_PATH" "EXPLANATION". Both DOWNLOAD_PATH and EXPLANATION may have shell meta-characters so quote them. The script attempts to validate that downloaded files are PDFs but the checking may not be perfect.

  5. Files are again written to ~/anthology-files/pdf/.... The original is downloaded, copied to v1, and overwritten. The revision is saved as a revision and overwrites the original so it is served by default.

  6. Double-check and then commit the XML changes that were made. Create a PR. Once it is cleared, rsync the files as above. Then clean out ~/anthology-files/pdf.

Reporting

When you're finished, use the script bin/summarize_additions.py to produce a list of changes. This script takes the git diff with the corrections on STDIN, and writes a formatted list of changes to STDOUT. Assuming you are on a branch and have committed, I suggest:

git diff master | ./bin/summarize_additions.py | pbcopy

(pbcopy available only on a Mac).

These should then be announced in the newsgroup.