EDTF (Extended Date Time Format) - UVicLibrary/Vault GitHub Wiki

Cheatsheet (Jump to)

How Solr Stores EDTF Dates
- Interval Dates ( / )
- Other Notation (#, X, %, ~)
Notes for Developers

EDTF or Extended Date/Time Format is a standardized way of expressing dates/times as a text string. From the Library of Congress website:

The Extended Date/Time Format (EDTF) was created by the Library of Congress with the participation and support of the bibliographic community as well as communities with related interests. It defines features to be supported in a date/time string, features considered useful for a wide variety of applications.

At time of writing, Hyku (out of the box) does not support EDTF. We have integrated EDTF into Vault using the edtf ruby gem, created by Sylvester Keil.

For example, Vault saves dates in EDTF notation (e.g. ["1913/1921"]) but displays them in humanized form to users (see image below). This conversion is handled by the edtf gem.

Human-readable label on a work's display page

We've also added the ability to sort works by title and date created in the following pages:

public collection pages - example
search results (sort results in ascending/descending order by title/date) - example
search results within a collection - example

How Solr Stores EDTF Dates

Dates are saved in Solr according to the ISO 8601 specification (see "Working with Dates") and can specify year, month, date, and time (UTC) for sorting and faceting purposes. Solr requires you to specify month, day, and time in its notation even if there's no month or day precision specified by the person entering the metadata.

General Principle: we sort according to the earliest possible specified date. For example, 1981 would save into Solr as 1981-01-01T00:00:00Z, or midnight on Jan. 1, 1981.

For single dates, as opposed to date ranges or intervals, we simply copy the first value in date_created_tesim to the year_sort_dtsi and year_sort_dtsim fields.

If the date is an interval, we want to sort on the earliest date in the interval by ascending/descending order. This earliest date is saved into year_sort_dtsi. However, when we filter search results (using facets), we want to save every year in the interval into an array (see example in table below).

Note: Single quotes mean that these values are strings.

Interval Dates ( / )

date_created_tesim is entered by a user. Based on that, Vault generates year_sort_dtsi and year_sort_dtsim when it (re)indexes a work.

Description	date_created_tesim	year_sort_dtsi	year_sort_dtsim
1. Two days within the same year	[ '1981-04-01/1981-05-02' ]	'1981-04-01T00:00:00Z'	[ '1981-04-01T00:00:00Z' ]
2. Interval that straddles 2 years; no month or day precision	[ '1981/1982' ]	'1981-01-01T00:00:00Z'	[ '1981-01-01T00:00:00Z', '1982-01-01T00:00:00Z' ]
3. Interval that straddles 2 years with day precision	[ '1995-12-01/1996-03-30' ]	'1995-12-01T00:00:00Z'	[ '1995-12-01T00:00:00Z', '1996-03-30T00:00:00Z' ]
4. Interval that straddles multiple years with day precision	[ '1901-02-04/1910-12-09' ]	'1901-01-01T00:00:00Z'	[ '1901-01-01T00:00:00Z', '1902-01-01T00:00:00Z', '1903-01-01T00:00:00Z'... '1910-12-09T00:00:00Z' ]

Sorted in ascending order (oldest to newest): 4, 2, 1, 5
Sorted in descending order (newest to oldest): 5, 1, 2, 4

Other Notation (#, X, %, ~)

X's are replaced with 0 — so 19XX sorts as midnight on Jan. 1, 1900 and 195X sorts as midnight on Jan. 1, 1950.

All other notation (#, ~ , ?, % ) is ignored — so 1950% sorts as midnight on Jan. 1, 1950.

Notes for Developers

Hyku saves almost every metadata field as a multiple, i.e. an array with one or more string values (see below). However, Solr can't sort on a multiple field like tesim. To get around this, we've modified Vault to create two new indexing fields called title_sort_ssi, year_sort_dtsi, and year_sort_dtsim.

Note that these extra data fields will show up only in the Solr interface and not in the rails console.

Related Files

controllers/catalog_controller - Blacklight configuration for sort fields
indexers/work_indexer - where Vault actually creates and indexes those fields

Sorting by Title

Solr sort alphabetically on title_sort_ssi, which is a duplicate of the first value in title_tesim. title_sort_ssi is a dynamic field with a single string value.

In work_indexer#generate_solr_document:

solr_doc['title_sort_ssi'] = object.title.first unless object.title.first.nil?

Code Walkthrough

First we handle the special characters and check whether date_created_tesim is a single date or an interval.

      if solr_doc['date_created_tesim']
        date = Date.edtf(solr_doc['date_created_tesim'].first.gsub(/~|#/,'').gsub('X','0')) # Account for special characters; see https://github.com/UVicLibrary/Vault/issues/36
        if date.class == EDTF::Interval
          solr_doc['year_sort_dtsim'] = solrize(date)
          solr_doc['year_sort_dtsi'] = solrize(date).first
        else # date.class == Date
          solr_doc['year_sort_dtsim'] = solr_string(date)
          solr_doc['year_sort_dtsi'] = solr_string(date)
        end
      end

2 functions in work_indexer do the heavy lifting:

solr_string converts an EDTF date to a Solr-formatted datetime string.
solrize uses the first date of an EDTF::Interval and creates a Solr-formatted datetime string.

  # Returns formatted string with time set to midnight; e.g. Wed, 01 Jan 1913 => "1913-01-01T00:00:00Z"
  # https://lucene.apache.org/solr/guide/7_7/working-with-dates.html
  def solr_string(edtf_date)
    date_time = edtf_date.beginning_of_day.to_s.split(" ") - ["UTC"] # => ["1913-01-01", "00:00:00"]
    "#{date_time[0]}T#{date_time[1]}Z"
  end