2 14 2022 Tech Team Report - QualitativeDataRepository/TechnicalTeam GitHub Wiki

2-14-2022

Logged Tasks

                            Date             Task Hours (Main) Hours (EOLS) Hours (PII) Hours (QDAS)
7-Feb-2022 Report, Drupal 9.3.5 update (Sun.), meeting 2
8-Feb-2022 Fix merge issues, update CC0/add QDR license on dev, investigate manage template delete failure 2
9-Feb-2022 Review security notices, deploy annorep service on stage/coord re testing config 1
10-Feb-2022 Add refiqda mimetype/deploy, fix merge issue re isIngestable 1
11-Feb-2022 Add Sets table, graph, investigate zip file mgmt options, poc of zip download, explore s3 efficiency 7
12-Feb-2022 Debug/fix/test zip download, fix mimetype typo 2

Operations

  • Reviewed ~weekly Drupal security notices - none applicable

Dataverse

  • Fixed merge issues, update the CC0 1.0 license per #8413
  • Investigate manage template delete failure (appears to pre-date ~v5.10, no fix yet, only affects display update)

AnnoRep

  • Rebuilt/deployed the AnnoRep server on stage and coordinated with To re: the networking plan as far as I understand it. (FWIW: I haven't yet made the hostname a parameter in the server, which is something to do before this is 'done'.)

QDAS

  • Continued proof-of-concept development
    • Added the qdc, qde, and qdpx mimetypes to be recognized by Dataverse (the last as a zip file for now)
    • Added a table for Sets and used Cytoscape to draw the Graph(s) in a project file
    • Considered options for displaying but protecting the contents of a qpdx file and decided to try a proof-of-concept to allow access of the zipped files while storing the qpdx as is. (See discussion).
    • Implemented a proof-of-concept allowing sources within a qpdx to be downloaded.

Discussion

  • Is there a QDAS mimetype? Either for the zip file or the xml one? (not a blocker - we can create unregistered ones but better to use what's already there, easy to change later)
  • Do we have a full QDAS Zip example I can develop with?
  • FWIW: In thinking through the idea of how to keep the qpdx contents 'managed', i.e. visible but not allowing other files to be added to the sources dir, having people change names or other file metadata in Dataverse, etc. I thought it might make more sense to just no unpack the zip in the first place. There are essentially two issues with this - the files don't appear in the Dataset file listings and one can't download individual files. From prior work (on SEAD in particular) I know it is possible to efficiently get a file list and to download individual files from a zip, and given the potential for a solution to these issues for Dataverse to also have a big impact on it's ability to handle large numbers of files (e.g. by not creating 1000 data file entries for a 1000 file zip), I went ahead to create a proof-of-concept to allow downloads from within a Zip - usable but not as efficient as possible (see plans). For qdpx files, I'll update the previewer to have live links to the individual source files. For general zip files, it might make sense to do something like create an aux file with the list of files and have a Previewer that just exposes that list -tbd whether something like this would be acceptable/of interest in the community but my guess is that this could be a quick win - having large numbers of unzipped files appear in the basic file list isn't that useful and the tree view is not full featured. Regardless, if this looks reasonable for qpdx, I'll continue down this path as I think it will be less work than trying to implement some form of 'protected' files and paths.

Plans

  • Dataverse
    • Deploy multi-license ~v5.10 for testing on stage if/when it looks OK on dev (i.e. is something OK for AnnoRep testers working on stage)
    • Popup info accessibility - IQSS likes the recommendations from the source I linked to, so this can be implemented along those lines.
    • QDAS planning/design/prototyping
      • Add file links
      • Add error handling for format variations
      • Update zip access proof-of-concept to use a Seekable Channel (FWIW: The trick to efficiently reading from archives is avoiding scanning through all the bytes to find a given file. Zip files include a directory and one can find the directory and then the byte range for a given file, etc. but if one has to start from a 'normal' stream (i.e. what you get from an S3 store and the common-denominator Dataverse uses for files as well), you still end up having to scan through the zip to the bytes you need and, since I believe zip can require you to go backward in the file, you have to keep the bits you've read in memory/a temp location as well. When using a file system directly, seekable channels allow you to request any range of bytes you want, and Java/the file system, etc. know how to skip forward and backward to the places you select and read only the bytes you want (although at lower levels, the read is usually larger - caching so that the typical sequential read case is still efficient). Since S3 also allows one to request byte ranges, one can nominally do the same sort of thing there, but Java does not by default provide a seekable channel abstraction above that. There are however file-system-over-S3 projects that do and it appears that pulling out the seekable channel implementation from one of those and adapting it should work for Dataverse. With that in place, and a tweak to the general StorageIO mechanism in Dataverse to let you get it from Files or S3, extracting from a Zip would be much more efficient (i.e. you'd be retrieving roughly the number of bytes in the requested file rather than the size of the zip itself which could be 10-10K time larger. The main problem with this approach is that one can't redirect (as with direct downloads) so Dataverse itself would be doing the processing. (However, future work could involve a separate service that just does this extraction and Dataverse could redirect to it.))
    • Still want to investigate the guestbook responses re version info not being included.
  • Anno-Rep work
    • Help with deployment to dev
  • TBD: FRDR Security
  • Other tasks as discussed in strategic planning