3 29 2021 Tech Team Report - QualitativeDataRepository/TechnicalTeam GitHub Wiki

3-29-2021

Logged Tasks

                            Date             Task Hours (Main) Hours (EOLS) Hours (PII)
22-Mar-2021 Report, mtg., coord re: aws sync settings for backup 2
26-Mar-2021 AnnoRep- test capturing overlapping comments 4
27-Mar-2021 Debug/fix Bag generation issue, investigate/report Jenkins issue, generate annotations, send file 2 2
28-Mar-2021 Deploy bag fix to dev, stage, add delete aux file method and check for existing on add aux 1

Summary

Dataverse

  • Debugged/fixed an issue in Bag generation - related to lack of thread-safety in classes used for managing HTTP call context. Deployed to dev (in dev branch)/stage(in a 5.3-qdr5 version of the 5.3 branch) for testing.

Operations:

  • Investigated use of aws sync in backups, suggested changes to avoid unnecessary updates and to keep files deleted from Dataverse.
  • Investigated/reported Jenkins being broken - confirmed fix by deploying new Dataverse versions to dev/stage

AnnoRep:

  • Debugged/extended classes capturing 32 chars before/after a comment, added tests for them,
  • Updated the annotation creator code to use them and verified that multiple overlapping comments are captured correctly
  • Investigated how paragraph boundaries are represented - not sure if it is always true but now adding '\n\n' between paragraphs in pre/post and commented text used in TextQuoteSelectors
  • Implemented basic json annotation structure to store generated annotations as an aux file
  • Started updating aux file code in Dataverse to not allow multiple files with the same ~name ('tag' plus 'version' in the way they're implemented) and to delete existing aux files.

Discussion:

AnnoRep is reaching the point where some additional coordination/discussion will soon be useful:

  • Test .docx files to verify pdf conversion and extraction of annotations
  • Evaluation of selectors against pdf (several potential issues: are the selectors robust when applied against the pdf given that I'm generating them from the docx - both do they work (with fuzzy eval that means they're ~close) and are the exactly correct (fuzzy eval was meant to handle changes so it would be nice if the initial extraction isn't already off in some way))
  • Discussion of docx/pdf/annotation lifecycle - are there multiple docx versions? Are edits of annotations managed by AnnoRep? Are they applied to the docx as updates?
  • Currently the back-end is storing a json file with ~partial annotations ready for submission to Hypothesis (i.e. they don't have any date, or association with a Hypothesis account which would be assigned if/when the Hypothesis API is used.) Should the back-end submit them and capture the complete versions (as we do when retrieving annotations for archival purposes) and store that? Should the front-end manage interactions with Hypothesis once the initial annotations are extracted? (May depend on whether back-end adds changes back to the docx?)
  • Could/should the backend do things like inspect DOIs in comments and replace those with hyperlinks to any matching datafiles before storing/submitting?

Plans

  • Anno-Rep work -- Add a call to list aux files. Make a PR to update aux files (in Dataverse 5.4, they are only allowed on tabular files, multiple copies can be uploaded which then breaks download, no delete, no way to list.) -- Start deploying service on dev once the basics are in place -- Cleanup code and post to annorep repo (I started from a Springboot example project and haven't removed irrelevant code, am considering keeping an in memory db to cache entries) --Investigate additional selectors (e.g. TextPosition) -- Support use of Dataverse API as needed

Still TBD:

  • Drupal 9/composer 2/3