2023.09.13 Community Meeting - OCFL/spec GitHub Wiki

Call-in Details

Zoom Link: https://emory.zoom.us/j/7074635164?pwd=SExsZ1NwYjVlNy9ZWHJHZ09BYXVxQT09

Attendees

  1. John Weise (U Michigan)
  2. Jürgen Enge (Universität Basel)
  3. Jared Whilko (University of Manitoba)
  4. Seth Erickson (UCSB)
  5. Simeon Warner (Cornell)
  6. Doron Shalvi (US NLM)
  7. Laurie Arp (Lyrasis)
  8. Tommy Keswick (CalTech)
  9. Robert Doiel (Caltch Library)
  10. Rosalyn Metz (Emory)
  11. Nicole Scalessa (Vassar College)
  12. David Novak

Agenda

1. Welcome

1.1 Community updates (introductions, updates, implementations, plans, etc)

John - Have not yet implemented OCFL but very interested

Jared - Moving from Fedora 3 to Fedora 6, have also started discussion about whether ArchiveMatica will implement OCFL

Seth - Have implementation

Laurie - Looking from Lyrasis perspective

Juergen - Build library in go. Now working to move to OCFL archive in 2-3 years

Doron - Doing a Fedora 3 to Fedora 6 migration

Tommy - Have implemented some Invenio RDM repos. Working on a new preservation and interested in aligning with the community.

Simeon - OCFL editor. Looking to move Cornell preservation from home grown to all-cloud OCFL implementation.

Robert - Working with Doron, have connected and discussion with Neil Jeffries about work on a JSON data store with attachments.

Nicole - Using Islandora reposityory over Fedroa 6 and using that as preservation system

Rosalyn - OCFL editor.

2. Community Listening Sessions

2.1 How are things going with implementing and conforming to version 1? Are there specific use cases that you feel need to be addressed or clarified?

Juergen - Have problems with using OCFL Objects that have many many files. Would like to use one zip file per version (https://github.com/OCFL/Use-Cases/issues/33). Currently doing a zip per object, up to 300GB

Juergen - Extension issues: how do we get rid of extension warnings. Need either to have a better process to get extensions registered or to not require warnings to be generated per the spec.

2.2 How do you imagine your storage needs evolving over the next decade; what are you concerned about? How might OCFL help address these issues?

At NLM has been testing with small repository. Are also looking at design issues for a large repository and interested in pointers for how to deal with large systems. Some video master files are very large (maybe 100's GB, perhaps even a few TB). In Fedora 3 these large files are managed externally to Fedora in own human readable file structure -- seems like a good map to OCFL. How do we navigate the creation of multiple files and directories in OCFL?

NLM decision to manage files externally for Fedora 3 was to avoid sending files through the Fedora API. This feels like a good decision and have been happy to have Fedora aware of these files but not directly as Fedora objects. Felt that this was a better preservation posture. Like the idea of managing the master copies separate from the master copies.

NLM goal is better preservation with transparency and good management. Organization has moved to AWS as well as on-prem. Plan to use multiple copies strategy. Look to OCFL as future organization for masters. Secondary goal might be connection to the Fedora 3 to Fedora 6 migration. Wondering about whether having 2 OCFL Storage roots makes sense.

2.3 What issues are you concerned about when it comes to versioning your data? How might OCFL help address these issues?

Current NLM structure with small repo is one XML blob per citation. Found out that there were many files and directories in OCFL and had to work with that. Not using versioning in the small system. With larger systems imagine more issues, want to use versioning, and may have to think about optimizations.

Rosalyn notes that there have been past requests for the ability to fork versions.

Doron notes that mutable head covers many use cases. They have situations where they have lots of updates in a very short space of time, would not want to create separate versions but think of mutable head.

Juergen wonders about file deltas for new versions, instead of having to copy whole file again. This isn't an issue for him directly (modest text files) but imagines uses that might benefit significantly from this (e.g. large research data).

2.4 Multiple copies in different places

Should this be beyond the scope of OCFL?

Indiana also thinking about this?

2.5 External files

Does it even make sense to have OCFL "manage" external files in the way that Fedora 3 does?

Past work: https://github.com/OCFL/Use-Cases/issues/35

Doron/NLM - Current size is about 150TB but imagine migth have more. Interested in the possibility of not having to move content, or what about reference to something that exists on a tape system? Note sure whet

Jared - Curious about expectations and preservation needs for an OCFL object that references external content.

Juergen - OCFL could do something like put a virtual filesystem inside and OCFL object. Then there could be any type of link

John - I can imagine the vague edge cases too, but I’m struggling to think of a situation where at U Michigan Library we would want to have external references.

Seth - It might make sense to have a storage root-level “content” directory that is shared by objects in the root, but it seems counter productive to link to files outside the storage root.

2.6 Object stores

Jared - From Peter Winkles work of OCFL-Java there are things that have to be dealt with to work with both filesystems and S3 stores. Is there a way to use the preservation tools of cloud systems and improve how files are uploaded and retrieved.

Tommy - When storing in S3 AWS, there is an etag of things you store. Checksums in past work with bags of bags got really messy if you don't want to transfer the files again. Would one want also to store AWS etags somewhere in OCFL?

Seth - Would like to see better support for the graceful creation of new versions. Process typically means staging content somewhere, stage it, write new structure and move/copy files in. Would be nice to have better spec support for graceful creation of new version with non-error condition for partially complete. A solution would be to allow content to be assembled in a new version perhaps not be considered part of the object until the new inventory is written. Part of this is aboout what operations are more or less atomic on objects stored vs filesystems.

The mutable head extension is somewhat different, described at: https://ocfl.github.io/extensions/0005-mutable-head.html

2.7 Version of the object vs version in the object

Spec is currently unclear in description of the specification version of the object and the specification of the a version within the object. There could, for example, be a NAMASTE file in the version directory.

2.8 Too many MAYs

Seth - Is it necessary to allow an upper and lowercase in a digest. Could each have a normalized form?

3. Next community meeting:

Recording

New Action Items

  • action items here

Previous Action Items