Overview - uchicago-library/ldr_documentation GitHub Wiki

Digital collections at UChicago

Intuitively, a collection is a set of objects of interest that have been curated for use by library patrons, for potential use in exhibitions, or for potential use in courses offered at the library. For example, we maintain a collection of magazines, bulletins, newsletters, circulars, etc. that have come out of UChicago dating from the 1890s to the 1960s.

A digital collection is a collection a significant portion of whose objects are in the form of computer-readable data, rather than physical objects. The collection could consist entirely of digital files—such as a webcomic—or it could consist of a physical collection plus digital scans of the physical items—like a print comic that the library wanted to make available both in its original form and digitally through the web.

Digital collections websites

Historically, our digital collections websites have been done as one-offs at different times over a span of 30+ years. Some of them are static websites, some of them are dynamic PHP applications run under Apache, some of them are Flask applications, and some of them use older/more exotic technologies.

Eventually, every one of our long-running websites reaches the point where software it crucially relies on is end-of-lifed and the website needs to be redone, if we want to continue to host it. This is why current digital collections work at the DLDC currently involves both redoing old websites and building new ones.

Our long-term goal is for all of our digital collections websites to be part of our main library website, which is a web application that uses the Wagtail CMS. The advantage of folding all of our digital collections pages into our centralized library website is that when we keep our Wagtail application up to date—a maintenance task that we periodically need to do anyway—we thereby also automatically keep most parts of it (mostly) up to date, avoiding the need to go in and perform the costly work of maintaining hundreds of completely different web applications. Having one centralized library website also allows us to easily reuse features between different collections.

Division of labor between the back-end and the front-end

Our digital collections websites follow the model of server-side web application frameworks like Ruby on Rails, Django, and Flask, where the website is in fact an application running on the server and creating all the HTML to be viewed in the browser on the fly, as opposed to serving up static HTML files.

A large web application of that kind will typically have two components: a back-end and a front-end.

The back-end

The back-end of a digital collections website is the part of the application that is responsible for retrieving all the data that will appear in the browser when a user is clicking and scrolling through the site. Here are some examples of features that might go into a back-end interface:

the app will provide a listing of all the items in a collection by a particular author, along with links to the full item record for each of these
the app will provide a listing of all the authors whose works are present in the collection, each one containing a link to a listing of all the items in a collection by that author
the app will search for an item record containing a particular search string
etc.

Etc. Those are just two features that might go on such a list, but designing the full set of backend features means coming up with an exhaustive list of all the informational queries a user will be able to perform while using a site, whether they're the result of clicking on a browse link or typing a search term into a search box. If the information needs to come back in some kind of structured way, that will also be part of the design process. For instance, is the user going to be looking at one flat list of links, or are they going to be looking at something with hierarchical structure, like a list of lists?

In the context of the digital collections work being done at the DLDC, we use the term action to refer to the informational queries a particular back-end performs.

Once the back-end interface for a collection has been designed, or mostly designed, it falls to our digital collections back-end developers to implement it. The details of what that task entails are somewhat complicated and will be spelled out in our knowledge graphs documentation. But the end result is that the front end of the application is supplied with code it can run for every type of informational query it needs to do.

The front-end

The front end of one of our digital collections websites is the part of the application that is in charge of creating a user interface in the browser for the user to click through. The back-end needs to be complete and working in order for development on the front-end to begin, though it is possible for front-end developers to prepare somewhat in advance by working with dummy content.

The front end code implements the point and click user interface that patrons see in the browser. It is responsible for building the graphical elements in the browser, retrieving the data it is going to display by running the back-end code, and arranging the data it gets back for display to the user. The front-end part of the software is responsible for answering questions like this:

when the user goes to a particular URL in the browser, what will appear on the page?
when the user clicks on a given link, what will happen?
when the user clicks on a given buton, what will happen?
etc.

Knowledge Graphs

When we are tasked with bringing either a new or an old collection into our framework, we need to migrate the metadata for that collection from its format of origin into our data format. All of the metadata for all of our digital collections are stored in a knowledge graph. We will provide a more thorough introduction to knowledge graphs in a different document, but here is a quick description of what the format accomplishes and why we are using it.

Ordinarily, when building a database whose data are formatted in the shape of a table (like a spreadsheet, or a dataframe, or a relational database), we need to make top-down decisions about how the data are going to be structured. What kinds of entities are going to exist, according to our data model? Authors, books, and publishers? Anything else? We also need to be opinionated about what kinds of properties each of these entities can have. Authors have names and lists of books that they wrote, but will they have birth and death dates? Will they have lists of languages they wrote in? Will they have lists of pen names they wrote under? Etc.

In some situations, when staff at a library reach for ready-made software to store their metadata, these questions were answered in advance by the team that created the software. But when a library creates their own custom database to house all their metadata, they answer these questions via deliberation about what they have or don't have in their collection, and what they want patrons to be able to see, access, and navigate.

At the DLDC, we deal with large numbers of collections that originate in all sorts of digital formats, with no guarantee of any consistency between them. Because of this, we are often in the position of needing to be able to quickly query and browse through the data we have, just to see what is there, before forming any opinions about how it should be structured. It is putting our digital collections metadata in the form of a knowledge graph that allows us to do that. Knowledge graphs are particularly permissive about the structure of the data they contain, and are designed to work even when they contain data from different sources that make inconsistent assumptions about what kinds of entities there are and what kinds of features they can have. So effectively, we store all of our digital collections metadata so that we can have it all in one format, in one place, without having to do the intensive work of normalizing the data within each incoming collection.

Data Migration

Moving the metadata from an incoming digital collection into our centralized knowledge graph is a complex process, and the most labor-intensive stage of our workflow. Every incoming collection has its metadata stored in a different format, and could require different software or computer platforms to read. Nearly always, the origin format requires complex bespoke processing before our data migration developers can look at it.

Determining exactly how to map metadata fields from the incoming collection to the rich and extensive metadata vocabulary of our knowledge graph is a long process of the developer trying one approach out, showing it to our metadata experts, finding anomalies in the output, going back to try something else, then iterating that process many times. The code is not excessively complicated to write, but what does take a long time is looking carefully over every last detail of both the source and destination formats and making sure everything is correct. There are often unforeseen challenges, even at the level of deciphering the incoming digital collection's data format, which may or may not be properly documented. To take one example, one of the collections we spent a long time working with originated as a FileMaker Pro database with primary keys that appeared to be duplicated, rendering them useless as primary keys. Figuring out how those were intended to be used required a nontrivial amount of detective work and experimentation, and it was a basic prerequisite for using the data.

Viewing Media

Different digital collections differ in what types of media they contain, and we use various external software packages to display any given digital item to the user in the browser. As of December 2023, we are sal Viewer](https://universalviewer.io/) to display high-resolution graphical scans

Panopto for embedding audio clips that are not open to the public
the embed functions of YouTube, Vimeo, and the Internet Archive for embedding videos that are available to the public

Wagtail and Flask

Our main library site is a large Wagtail web application. Wagtail is an open-source content management system that provides similar functionality to e.g. WordPress or Drupal. That is, it provides a web application that library staff can log into and create pages in using a GUI, without having to manually write HTML. Unlike its competitors, Wagtail allows for extensive customization that leverages the full power of the Python software ecosystem. Since our staff at the UChicago library have a long tradition of authoring their own content, and our software developers have a long history of customizing complex web applications, this makes Wagtail an excellent fit for our needs.

At the same time, we have hundreds of digital collections websites that we are eventually planning to incorporate into our Wagtail site, and each of these websites is intensively feature-rich, making it challenging to both update the site and incorporate it into Wagtail at the same time. For this reason, as of August 2023, we have adopted a two-stage workflow whereby we prototype each collection page as a standalone site using the Flask web framework, put it into production in that way, then incorporate it into our Wagtail site as a separate step.

A Two-Stage Workflow

Building a Flask prototype allows us to work with the complete data associated with each collection and do agile development on the back-end and front-end together at a fairly brisk pace, determining exactly what type of user interface would work for each collection. These prototypes are still complicated web applications, but they are still simpler than annexing a digital collections page onto Wagtail directly, because when the development team puts a standalone website together, there is no larger entity to make the digital collections page consistent with, which means that a large amount of the information about each collection can simply be hardcoded.

For instance, if I am building a new Flask prototype for a website featuring digital scans of a comic book and I would like to include a breadcrumb trail on each page displaying pages from one of the books in the collection, I can just insert a breadcrumb trail to do that with a small amount of code. In Wagtail, there would be one centralized piece of code whose job it is to build a breadcrumb trail for thousands of different pages, and if I want to create a new breadcrumb trail on a new page that does something new, I will need to extend the general-purpose code in a careful way, making sure not to introduce bugs into the breadcrumb trails that Wagtail is already building for other pages. Developing intermediate prototypes in Flask thus allows us to postpone the fussy process of generalizing every piece of code that can be generalized to a later stage.

Grant Deadlines

The main thing to note re: project timelines is that only the Flask prototype sites are on the grant deadline clock. Incorporating each Flask site into our Wagtail application is a long-term project of ours that is not tied up in the timeline for any of our grants.