6 28 2021 Tech Team Report - QualitativeDataRepository/TechnicalTeam GitHub Wiki

6-28-2021

Logged Tasks

                            Date             Task Hours (Main) Hours (EOLS) Hours (PII)
21-Jun-2021 Report, meeting, fix anonymization PR/make strings i18n-able 1 1
22-Jun-2021 Investigate FRDR secure app, initial impl of status labels 4
23-Jun-2021 Create branch for Status labels, coord with Phillip 1
24-Jun-2021 D8 Webforms update to dev/stage, anonymized acces demo, updates to UI/logic/tests/docs, update PR 1 2
25-Jun-2021 Updates to anonymized PR per review 1

Summary

Operations:

  • Deployed webforms update on dev, stage

Dataverse:

  • Demoed the anonymizedAccess PR to IQSS and updated it - flyway script, i18n of error strings, and minor changes per review
  • Created an initial curation tags/labels implementation (on dev) - simple API to set a tag and display of that tag along side the other dataset labels at the top of the page.
  • Investigated the FRDR encrypted storage mechanism

Discussion

  • which grant supports Curation tags?
  • Curation labels: Given issue#6886 and Philipp Conzett's note there of potential funding this fall, I wrote him a note and asked several questions about how they expected labels to work. My questions, which should be valid for QDR discussion as well and his answers are below. As noted, my plan at this point is to add a setting to limit the allowed values and to then change the permission required to publish dataset, which would allow changes by curators (using the API). If we want a UI or other changes after discussion, I can look into those as well.

In our DataverseNO Plus grant proposal we have a WP called "Deposit and Curation", where we describe some desirables, but not on detailed level:

Curation: Enhanced support for i) feedback on deposited datasets within the system, not through email as in the current version; ii) keeping track of the progress of curation of datasets; iii) more granular notice of datasets submitted for review based on the subject / research domain of the dataset and the skills background of available curators.

I haven't had the chance to discuss your questions about the status tags with any of the curator teams in DataverseNO as people are about to leave for vacation (July and the beginning of August are vacation time at schools and universities in Norway). But I have tried to add some comments I could think of myself; see text in green below. I hope this can serve as input for further discussion. I'd be happy to also ask our curator teams about this after the summer break and/or test any new functionality.

If I were to try asking specific questions (will probably ask them of QDR folks as well): - Is a UI useful? Or will places that formalize curation usually/always have a tool that can run the API? >> For our use cases in DataverseNO, a UI would be very useful. I guess only very few (maybe 1-2) of all the approx. 40-50 curators involved in DataverseNO would feel comfortable using an API. Currently, we are not planning to use a curation tool integrated with Dataverse, but hopefully this will change if we get the grant to upgrade DataverseNO. In that case, developing a curation management tool will be part of the WP mentioned above.

- Are tags only needed on draft datasets? (The current design removes them at publication) >> For current DataverseNO use cases, tags on draft datasets would be most needed. The one use case for published datasets I can think of right now is an Embargoed tag; see below.

(with embargo handled as a separate PR, this won't be a use case for the custom curation tags)

- Are multiple tags needed? (Could be noisy but if there are multiple curation processes with states then we might need that) >> I think multiple tags would be useful. Four tags that I can think of right now, are

Returned to Author Approved by Curator Under Double-Blind Review Embargoed

and several of them could apply simultaneously.

- Assuming I make the change above to add a setting, does allowing anyone with publish permission use the API (and UI if it exists) make sense? That would allow installations to turn this off (no setting), for self-publish sites to let people add tags (from the list) on their own (which get deleted at publish), and for curated sites like QDR to limit use to curators. >> For DataverseNO, this would definitely make sense, as only curators can publish, not depositors.

- If there’s a UI, where does it go? >> I can think of two places: 1) As a pop-up window after the curator has pressed either the Return to Author button or the Publish button. The curator would then be asked to enter (preferably choose from a list) an appropriate tag. Also, would it be possible for some of the tags to be added and removed automatically? For example: The Returned to Author tag should be added when the dataset is returned to the author, and removed when the author has resubmitted it for review. The Approved by Curator tag should be removed when the dataset has been published. 2) As an option "Add or Edit Status Tag" in the dataset Edit button.

- I haven’t yet tested, but I think all labels are faceted so the new ones would be searchable like the existing ones. Is that OK (without change I think that means those facets will be visible to people who can’t see draft datasets)? Is it sufficient? (Do we need a specific API/UI to find datasets in each state or does a search query do well enough?) >> Faceting would be very useful. Currently, a search query would do well enough. I guess the search query can be applied within / restricted to sub-dataverses?

(not sure this was understood - the tags are faceted by default - my question is whether that's enough)

- Implementing as a label doesn’t really track the state, i.e. there is no direct record of who changed the state at what time. It probably makes sense to add the changes to the actionlog where those details could be found if/when needed. Conversely, I could add a new table in the db and really track all of this/add an api to let you retrieve the whole lifecycle, etc. (Or one could go wild and consider whether the processing in Dataverse ought to be getting added to a provenance record (similar to the ones you can manually upload for files). I suspect this would be out of scope for now, but if that’s what makes sense longer term, it might argue for keeping the initial implementation simple / just adding to the action log for now.) >> I think as a first step, keeping it simple sounds like a good idea. Maybe at a later stage we could elaborate along the provenance route.

  • FRDR's secure storage mechanism: From what I could see, this is fairly separate from Dataverse - a tool that encrypts files locally before they sent to Dataverse, with Dataverse not understanding that the files are encrypted (aside from mimetype perhaps). It looks like it does use a key management mechanism (Hashicorp) to allow decryption keys to be shared with downloaders as desired (again independent of Dataverse's access control mechanism).
    • I would expect IQSS's DataTags work (status unknown) would include the idea of Dataverse managing encryption prior to upload and/or some mechanism to indicate that a file is encrypted and how to get the key. I could imagine the work to allow external request authorization mechanisms (ADA) and/or the TRSA remote store mechanism could be leveraged as well (e.g. to redirect the user to some external mechanism (perhaps ala the FRDR work) that manages access requests/decryption keys.)
  • Are there next steps w.r.t. this? A discussion to see how/where FRDR might see new interactions with Dataverse helping to simplify the workflow to store/get encrypted files?

Plans

  • Anno-Rep work
    • Compare w/ Hypothesis selectors
  • Dataverse
    • still want to investigate the guestbook responses re version info not being included.
    • add curation label setting (allowed values) and change permissions to curator (canPublish), make sure uses are action logged.

Still TBD:

  • Drupal 9, Composer 2-->3