Transform Guidance - uwlib-cams/MARC2RDA GitHub Wiki

Transform Guidance

:construction::construction:*This documentation is not up to date:construction::construction:

The transform code is designed to take MARC/XML and output RDA in RDF/XML. It operates through field-by-field templates and custom functions, utilizing modes to construct rdf:Description elements for each entity

Organization and files

The transform code is divided into three main layers.

m2r.xsl contains templates which process the file that contains the MARC records and then each individual record.
Templates for processing individual fields are split by field number into files 0xx.xsl, 1xx.xsl, 2xx.xsl etc.
Each of these files has an associated file that ends in -named.xsl. These contain named templates that process the field or its subfield under specific circumstances.

m2r.xsl

<xsl:template match="/">
- matches root of input document
- applies templates: marc:collection
<xsl:template match="marc:collection">
- matches marc:collection
- creates output root rdf:RDF with namespaces
- applies templates: marc:record
<xsl:template match="marc:record">
- matches marc:record
- creates rdf:Description nodes for each RDA entity (Work, Expression, and Manifestation)
  - Assigns IRIs for each entity
  - Creates relationships between these entities
- Applies field-specific templates using [modes](link here)

0xx.xsl, 1xx.xsl, 2xx.xsl, etc.

These files contain field-by-field templates. Each template within these files matches on a specific field and in a specific mode. Within each file, templates are organized numerically by field number.

Templates may appear similar to each other, but fields are generally kept separate to make the code more easily understandable.

Modes

A template's mode determines where the resulting RDA properties appear within the output.

<xsl:apply-templates select="*" mode="wor"/>

The "wor", "exp", and "man" modes mean that those properties will appear within the rdf:Description for that entity.

Additional RDA Entities are only minted as necessary. These templates are called outside the main WEM rdf:Description elements, and includethe modes "ite", "nom", "metaWor", and "age" for Items, Nomens, Metadata Works, and Agents respectively.

 <!-- *****NOMENS***** -->
<xsl:apply-templates select="*" mode="nom">
    <xsl:with-param name="baseIRI" select="$baseIRI"/>
</xsl:apply-templates>

xxx-named.xsl files

The -named files contain templates that are called from within the field-by-field templates.

Field templates call named templates for more complex handling or when there are multiple scenarios.

Template names

These templates are named following a specific pattern:

F{field number}-{ind1 value}{ind2 value}-{relevant subfields}

Each starts with F + field number e.g. F336
The next characters are for the indicators. If the template is only called for a specific indicator value, specify that here.
- e.g. F561-0x is called when the 1st indicator is 0.
The last characters indicate specific subfields.
- F526-xx-iabcdz5 indicates that this template handles the listed subfields and that they are concatenated into one RDA property
- F264-x3-a_b_c indicates that this template handles the listed subfields but each subfield maps to a separate RDA property
Some template names may end in '-iri' or '-string', indicating how the data is being handled.

These naming conventions do not account for all possible cases that may be encountered, but are intended to act as guidelines so that the purpose of a template can be more easily understood by anyone looking at the code.

Functions and special templates

m2r-functions.xsl

m2r-functions.xsl contains custom functions that are repeatedly used across the transform code.

These include functions for handling $2 and $5 subfields that retrieve IRIs based on MARC codes.

Detailed documentation is available here

m2r-relators.xsl

m2r-relators.xsl contains templates and functions specific to handling relator relationships.

Currently this is the 100, 110, 11, 700, 710, 711, and 720 fields, which are similar enough to be handled with more general templates and functions.

Relator relationships are determined based on the relator table, which is currently located in lookup/relatorTable-2024-05-15.xml

Detailed documentation is available here.

Test files

Test MARC/XML files and the resulting output are located in the following folders:

input - general MARC/XML test files
lookup - lookup files used by m2r-functions.xsl and m2r-relators.xsl
marcDatasets - larger MARC/XML files
output - general test output
outputDataForReview - output to be reviewed by the team
test_input - field-by-field tests
test_output - field-by-field test output

getmarc.xsl and appendLabels.xsl

These two files are used to make testing and viewing transform output easier by adding comments to the output which show the MARC input and the RDA property labels.

getmarc.xsl is called within each field template, while appendLabels.xsl can be run on the initial transform output to produce a second output that includes RDA property labels.

**lexicalAliases.xsl is not currently in use

m2r-$5.xsl

This file is used to produce Lookups/$5-preprocessedRDA.xml, which is a pre-generated list of IRIs for LoC's MARC Code List for Organizations used for performing lookups based on $5 - See decision on $5 in Decisions Index.

Minting Entities

RDA Classes

rdf:Description elements describing RDA Entities should include an rdf:type triple with the IRI for that class as subject.

<rdf:type rdf:resource="http://rdaregistry.info/Elements/c/C10001"/>

RDA Classes are available here.

Generating IRIs

Works, Expressions, Manifestations

At this stage, while we are doing testing, the base IRI we are using is "http://fakeIRI2.edu/". For each MARC record, an IRI for the associated Work, Expression, and Manifestation is generated in m2r.xsl by concatenating this base IRI, the record's control number, and 'wor', 'exp', or 'man'.

<xsl:variable name="baseIRI" select="concat($base, marc:controlfield[@tag = '001'])"/>

<rdf:Description rdf:about="{concat($baseIRI,'wor')}">

Items

In order to ensure a unique IRI is generated for each item, the IRI is generated using the base + control number + 'ite' + an ID generated using xsl's generate-id().

This is a unique id generated to describe the current node being processed. It will be the same value for that node the entire transformation, but is not guaranteed to be the same value when the transformation is run again.

<xsl:variable name="genID" select="generate-id()"/>

<rdf:Description rdf:about="{concat($baseIRI,'ite',$genID)}">

The item IRI is generated within the template for the field that identifies this item, so generate-id() is generating an ID for the that specific instance of that field within the input MARC record.

Nomens

Nomen IRIs also use generate-id() and begin with 'http://marc2rda.edu/fake/nom/'.

<rdf:Description rdf:about="{'http://marc2rda.edu/fake/nom/'||generate-id()}">

Agents

Still in draft format - see detailed documentation on m2r-relators.xsl

Concepts

skos:Concepts are minted for vocabularies, subject headings, and classification numbers. IRIs are genereated based on the source provided in the field's subfield $2 and the term used as the skos:prefLabel.

<rdf:Description rdf:about="{'http://marc2rda.edu/fake/concept/'||encode-for-uri(lower-case($scheme))||'/'||encode-for-uri(translate(lower-case($value), ' ', ''))}"

This is done using the function uwf:conceptIRI() located in m2r-functions.xsl

Metadata Works - see 583, 526

Currently, Metadata Works are created for private fields and for nonpublic notes. If more cases are added, this may need to be revisited.

For Items: Both the Item's rdf:Description and the Metadata Work's rdf:Description are generated within the field match template and named templates. See 583 for an example.

For Works, Expressions, and Manifestations: Because the field template is called within the rdf:Description for the entity, the field match template needs to be called twice, once with the mode for that entity and once with mode="metaWor" to create the rdf:Description for the Metadata Work. See 526 for an example.

Metadata Work IRIs

Metadata works also need to have unique IRIs.

The same IRI needs to be generated from multiple templates, because the metadata work is described both within the associated WEMI rdf:Description and within its own rdf:Description.

We currently use "http://marc2rda.edu/fake/MetaWor/"+ generate-id(). As stated above, this is a unique id generated to describe the current node being processed. It will be the same value for that node the entire transformation, but is not guaranteed to be the same value when the transformation is run again.

This works because there will always be a unique node associated with each metadata work. For a private field, the metadata work is associated with that field's node. For a nonpublic note, the metadata work is associated with the nonpublic note subfield's node.

To ensure we are getting the same generated id in multiple templates, we need to ensure that the context, that is the node we are processing, is the same in both templates (i.e. for a nonpublic note, we need to use a for-each to enter the subfield's context instead of accessing the subfield from the field's context.)

$6, 880, and item templates - see 561

For each field template, there are three possible matches:

The original field
A linked 880 field with a matching occurrence number
An unlinked 880 field with a 00 occurrence number

Currently, we mint a new item for each field that indicates the existence of an item in order to prevent wrongly assuming two items are the same (Decision I.C.1).

For fields that indicate the existence of an item that also have a $6 subfield, we know that the associated 880 field is referring to the same object as the original field.

Therefore, each item template for a field should match on the field, as well as any 880s with a $6 value of [field]-00. Then, if the field has a $6, match to the associated 880 field within the template using the occurrence number, and perform the same mapping within the template so that this field is linked to the same item. See 561 for an example.

For Works, Expressions, and Manifestations, these extra steps are not needed, because the field templates are called from within the rdf:Description for that entity. Each field template should simply match on that field tag and 880s with $6 values that begin with that field tag.

Handling 3XX fields - see 337

Lookups with RDA vocabularies

When vocabulary terms are from RDA vocabularies or LC's cloned RDA vocabularies, the term's IRI can be looked up and retrieved to be used as the object of the attribute property. The functions uwf:rdaTermLookup and uwf:rdaCodeLookup located in m2r-functions.xsl are used to do this.

These functions take the scheme code from the subfield $2 in the field and perform a lookup in lookups/rdaVocabularies.xml, matching the scheme code with the document that will contain the IRI for that vocabulary term or code.

A second lookup is then done in that document for the IRI associated with that term or code, and if it is found, the IRI is returned.

Minting concepts

When a $2 is present that is not an RDA vocabulary, we mint a skos:Concept for the term.

The function uwf:fillConcept() is used inside the <rdf:Description> for the concept, and has params for the prefLabel, scheme, notation, and field number. If any of these values are not present, a blank string can be used as the param.

uwf:fillConcept() will look up the scheme code provided in id.loc.gov's various scheme and code lists and attempt to match it with an IRI that can be used as the value of skos:inScheme.

When an 880 is linked with the field

and the field results in a concept being minted, uwf:fillConcept() can be called again with the 880 skos:prefLabel and skos:notation values within the same rdf:Description. The fieldNum param should be '880' to prevent an additional rdf:type triple being added to the rdf:Description.