Workload - acl-org/acl-anthology GitHub Wiki

ACL Anthology Editor Workload

Running the Anthology as the editor requires a significant amount of time, estimated to be about 4/6 hours of work a week, without accounting for development. Currently, there are a significant amount of manual processes, which could be somewhat more automated. The job commits the editor to the oversight and intervention of the server and its upkeep (like system administration) constantly, and is also peaky in workload (as new venues and conference get delivered for ingestion at particular times of the year).

Job Scope (from Min)

We currently ingest about 4K articles per year, in approximately 75 different venues (15 conferences, 60 workshops). An ingestion of a venue varies but on average takes about 2-3 hours. This is why I approximated that it takes roughly 4 hours a week to maintain the Anthology.

  • Ingestion of materials

  • Overview: Assigning IDs. Differences between START versus non-START.

    We first assign IDs to the Anthology materials. These are on a public spreadsheet that is accessible from the Anthology’s footer, but also has a shortlink. This step is done pre-production so that IDs are unique to a venue. (per ingestion 10 minutes)

  • Material Types

    • Conference
    • Workshop
    • Journals

    Actual ingestion takes the process of receiving the tarball, uploading it to the aclweb.org website and ingesting the metadata on the frontend tech stack (Ruby on Rails, Nginx, Solr). Contributors often do not name sources correctly and their XML has validation problems. Troubleshooting these can take several hours per venue. Also need to generate the reference files (BibTeX), update the single BibTeX file and re-index the database to facilitate search. (per ingestion: 3 hours)

  • Copyright Notices

    We archive copyright notices into a local drive that is backed up as part of the ingestion log for the Anthology. (per ingestion: 10 minutes)

  • Supplemental Links

    • Naming Convention

    • Attachments: Datasets, Software, Notes and (general) Attachments

    • Videos

    • Post-Publication

      • Posters
      • Versions and Errata
      • Retractions

      Post publication is usually batched but can be one off for urgent matters. I service about 50 of these per year.
      (per edit/fix: 30 minutes; batch edits (inclusive of original ingestion of supplemental materials: 1 hour)

  • DOI assignment

    Assign DOIs based on ACL Anthology ID. Need to create the IDs into a CrossRef uploadable XML file and upload to CrossRef. Then need to import the IDs back into the ACL Anthology so that the DOIs show up in the Anthology web site. We don’t do this service for Non-ACL venues. There are scripts that mostly work well for this. (per venue ingestion: 30 minutes)

  • Anthology Group

    Broadcasts Anthology ingestions, occasional service announcements (per venue ingestion: 15 minutes)

  • Volunteer Coordination

    • Software Development

      • Presence and Organization on Github
      • Docker Image
      • Git Issues
      • Volunteers’ Meetings

      We opportunistically try to clear issues from our Github issues queue, but we depend on volunteers. Without the ACL Exec’s help to pull volunteers and assign them to the Anthology, we are helpless. Issues can take 1-5 hours each of development time, given someone is familiar with the codebase, which itself can take a few days of dedicated time to absorb.

    • ARC Development

      This is work we’d love to do but depends on the development time. Creating a new ARC is distributed work and takes a few months and coordination time.

  • Reporting / Liasing

    • Requests for copyright clearance

      We are CC BY 4.0 so just to reply that it is ok and log the record. (per request: 15 minutes)

    • Indexing (Scopus, Google Scholar)

    • ACL Exec

      We have to file reports to the ACL Exec in the AdminWiki (per report: 2 hours; twice a year if requested)

    • Interface with ACL Information Office

      • Server Hosting