Transform Guidance - uwlib-cams/MARC2RDA GitHub Wiki
Transform Guidance
The transform code is designed to take MARC/XML and output RDA in RDF/XML. It operates through field-by-field templates and custom functions, utilizing modes to construct rdf:Description elements for each entity
Organization and files
The transform code is divided into three main layers.
- m2r.xsl contains templates which process the file that contains the MARC records and then each individual record.
- Templates for processing individual fields are split by field number into files 0xx.xsl, 1xx.xsl, 2xx.xsl etc.
- Each of these files has an associated file that ends in -named.xsl. These contain named templates that process the field or its subfield under specific circumstances.
m2r.xsl
-
<xsl:template match="/">
- matches root of input document
- applies templates: marc:collection
-
<xsl:template match="marc:collection">
- matches marc:collection
- creates output root rdf:RDF with namespaces
- applies templates: marc:record
-
<xsl:template match="marc:record">
- matches marc:record
- creates rdf:Description nodes for each RDA entity (Work, Expression, and Manifestation)
- Assigns IRIs for each entity
- Creates relationships between these entities
- Applies field-specific templates using [modes](link here)
0xx.xsl, 1xx.xsl, 2xx.xsl, etc.
These files contain field-by-field templates. Each template within these files matches on a specific field and in a specific mode. Within each file, templates are organized numerically by field number.
Templates may appear similar to each other, but fields are generally kept separate to make the code more easily understandable.
Modes
A template's mode determines where the resulting RDA properties appear within the output.
<xsl:apply-templates select="*" mode="wor"/>
The "wor", "exp", and "man" modes mean that those properties will appear within the rdf:Description for that entity.
Additional RDA Entities are only minted as necessary. These templates are called outside the main WEM rdf:Description
elements, and includethe modes "ite", "nom", "metaWor", and "age" for Items, Nomens, Metadata Works, and Agents respectively.
<!-- *****NOMENS***** -->
<xsl:apply-templates select="*" mode="nom">
<xsl:with-param name="baseIRI" select="$baseIRI"/>
</xsl:apply-templates>
xxx-named.xsl files
The -named files contain templates that are called from within the field-by-field templates.
Field templates call named templates for more complex handling or when there are multiple scenarios.
Template names
These templates are named following a specific pattern:
F{field number}-{ind1 value}{ind2 value}-{relevant subfields}
-
Each starts with F + field number e.g. F336
-
The next characters are for the indicators. If the template is only called for a specific indicator value, specify that here.
- e.g. F561-0x is called when the 1st indicator is 0.
-
The last characters indicate specific subfields.
- F526-xx-iabcdz5 indicates that this template handles the listed subfields and that they are concatenated into one RDA property
- F264-x3-a_b_c indicates that this template handles the listed subfields but each subfield maps to a separate RDA property
-
Some template names may end in '-iri' or '-string', indicating how the data is being handled.
These naming conventions do not account for all possible cases that may be encountered, but are intended to act as guidelines so that the purpose of a template can be more easily understood by anyone looking at the code.
Functions and special templates
m2r-functions.xsl
m2r-functions.xsl contains custom functions that are repeatedly used across the transform code.
These include functions for handling $2 and $5 subfields that retrieve IRIs based on MARC codes.
Detailed documentation is available here
m2r-relators.xsl
m2r-relators.xsl contains templates and functions specific to handling relator relationships.
Currently this is the 100, 110, 11, 700, 710, 711, and 720 fields, which are similar enough to be handled with more general templates and functions.
Relator relationships are determined based on the relator table, which is currently located in lookup/relatorTable-2024-05-15.xml
Detailed documentation is available here.
Test files
Test MARC/XML files and the resulting output are located in the following folders:
- input - general MARC/XML test files
- lookup - lookup files used by m2r-functions.xsl and m2r-relators.xsl
- marcDatasets - larger MARC/XML files
- output - general test output
- outputDataForReview - output to be reviewed by the team
- test_input - field-by-field tests
- test_output - field-by-field test output
getmarc.xsl and appendLabels.xsl
These two files are used to make testing and viewing transform output easier by adding comments to the output which show the MARC input and the RDA property labels.
getmarc.xsl is called within each field template, while appendLabels.xsl can be run on the initial transform output to produce a second output that includes RDA property labels.
**lexicalAliases.xsl is not currently in use
m2r-$5.xsl
This file is used to produce Lookups/$5-preprocessedRDA.xml, which is a pre-generated list of IRIs for LoC's MARC Code List for Organizations used for performing lookups based on $5 - See decision on $5 in Decisions Index.
Minting Entities
RDA Classes
rdf:Description
elements describing RDA Entities should include an rdf:type
triple with the IRI for that class as subject.
<rdf:type rdf:resource="http://rdaregistry.info/Elements/c/C10001"/>
RDA Classes are available here.
Generating IRIs
Works, Expressions, Manifestations
At this stage, while we are doing testing, the base IRI we are using is "http://fakeIRI2.edu/". For each MARC record, an IRI for the associated Work, Expression, and Manifestation is generated in m2r.xsl by concatenating this base IRI, the record's control number, and 'wor', 'exp', or 'man'.
<xsl:variable name="baseIRI" select="concat($base, marc:controlfield[@tag = '001'])"/>
<rdf:Description rdf:about="{concat($baseIRI,'wor')}">
Items
In order to ensure a unique IRI is generated for each item, the IRI is generated using the base + control number + 'ite' + an ID generated using xsl's generate-id()
.
This is a unique id generated to describe the current node being processed. It will be the same value for that node the entire transformation, but is not guaranteed to be the same value when the transformation is run again.
<xsl:variable name="genID" select="generate-id()"/>
<rdf:Description rdf:about="{concat($baseIRI,'ite',$genID)}">
The item IRI is generated within the template for the field that identifies this item, so generate-id()
is generating an ID for the that specific instance of that field within the input MARC record.
Nomens
Nomen IRIs also use generate-id()
and begin with 'http://marc2rda.edu/fake/nom/'.
<rdf:Description rdf:about="{'http://marc2rda.edu/fake/nom/'||generate-id()}">
Agents
Still in draft format - see detailed documentation on m2r-relators.xsl
Concepts
skos:Concepts are minted for vocabularies, subject headings, and classification numbers. IRIs are genereated based on the source provided in the field's subfield $2 and the term used as the skos:prefLabel.
<rdf:Description rdf:about="{'http://marc2rda.edu/fake/concept/'||encode-for-uri(lower-case($scheme))||'/'||encode-for-uri(translate(lower-case($value), ' ', ''))}"
This is done using the function uwf:conceptIRI()
located in m2r-functions.xsl
Metadata Works - see 583, 526
Currently, Metadata Works are created for private fields and for nonpublic notes. If more cases are added, this may need to be revisited.
For Items: Both the Item's rdf:Description and the Metadata Work's rdf:Description are generated within the field match template and named templates. See 583 for an example.
For Works, Expressions, and Manifestations:
Because the field template is called within the rdf:Description for the entity, the field match template needs to be called twice, once with the mode for that entity and once with mode="metaWor"
to create the rdf:Description for the Metadata Work. See 526 for an example.
Metadata Work IRIs
Metadata works also need to have unique IRIs.
The same IRI needs to be generated from multiple templates, because the metadata work is described both within the associated WEMI rdf:Description and within its own rdf:Description.
We currently use "http://marc2rda.edu/fake/MetaWor/"+ generate-id()
. As stated above, this is a unique id generated to describe the current node being processed. It will be the same value for that node the entire transformation, but is not guaranteed to be the same value when the transformation is run again.
This works because there will always be a unique node associated with each metadata work. For a private field, the metadata work is associated with that field's node. For a nonpublic note, the metadata work is associated with the nonpublic note subfield's node.
To ensure we are getting the same generated id in multiple templates, we need to ensure that the context, that is the node we are processing, is the same in both templates (i.e. for a nonpublic note, we need to use a for-each to enter the subfield's context instead of accessing the subfield from the field's context.)
$6, 880, and item templates - see 561
For each field template, there are three possible matches:
- The original field
- A linked 880 field with a matching occurrence number
- An unlinked 880 field with a 00 occurrence number
Currently, we mint a new item for each field that indicates the existence of an item in order to prevent wrongly assuming two items are the same (Decision I.C.1).
For fields that indicate the existence of an item that also have a $6 subfield, we know that the associated 880 field is referring to the same object as the original field.
Therefore, each item template for a field should match on the field, as well as any 880s with a $6 value of [field]-00. Then, if the field has a $6, match to the associated 880 field within the template using the occurrence number, and perform the same mapping within the template so that this field is linked to the same item. See 561 for an example.
For Works, Expressions, and Manifestations, these extra steps are not needed, because the field templates are called from within the rdf:Description for that entity. Each field template should simply match on that field tag and 880s with $6 values that begin with that field tag.
Handling 3XX fields - see 337
Lookups with RDA vocabularies
When vocabulary terms are from RDA vocabularies or LC's cloned RDA vocabularies, the term's IRI can be looked up and retrieved to be used as the object of the attribute property. The functions uwf:rdaTermLookup
and uwf:rdaCodeLookup
located in m2r-functions.xsl are used to do this.
These functions take the scheme code from the subfield $2 in the field and perform a lookup in lookups/rdaVocabularies.xml, matching the scheme code with the document that will contain the IRI for that vocabulary term or code.
A second lookup is then done in that document for the IRI associated with that term or code, and if it is found, the IRI is returned.
Minting concepts
When a $2 is present that is not an RDA vocabulary, we mint a skos:Concept for the term.
The function uwf:fillConcept()
is used inside the <rdf:Description>
for the concept, and has params for the prefLabel, scheme, notation, and field number. If any of these values are not present, a blank string can be used as the param.
uwf:fillConcept()
will look up the scheme code provided in id.loc.gov's various scheme and code lists and attempt to match it with an IRI that can be used as the value of skos:inScheme.
When an 880 is linked with the field
and the field results in a concept being minted, uwf:fillConcept()
can be called again with the 880 skos:prefLabel
and skos:notation
values within the same rdf:Description
. The fieldNum param should be '880' to prevent an additional rdf:type triple being added to the rdf:Description.