METS profiles - pulibrary/BlueMountain GitHub Wiki
There are two kinds of METS records in Blue Mountain:
- Title-Level METS – A METS document encapsulating information about the magazine title as a whole.
- Issue-Level METS – A METS document encapsulating information about an individual issue of a magazine.
These are described in greater detail below.
( Greater detail to come. )
The metadata for the title will be encapsulated in a title-level METS record: the title-level descriptive metadata (either as an embedded MODS record or pointed to), a pointer to the bibliographic history, and (possibly) pointers to issue-level metadata.
Thinking in terms of FRBR’s Group 1 entities (Work, Expression, Manifestation, Item):
- The work is the intellectual notion of a particular issue: “The third issue of Dada appeared in December, 1918.”
- The expression is the abstract realization of the work in some form: the French version of Dada 4-5, for example, contains different articles from the X version.
- The manifestation is the physical embodiment of an expression: the French issuance of Dada 4-5.
- The item is the physical copy. The files are likewise items.
The metadata for each issue shall be encapsulated in a METS record. A skeleton sample of such a record is the following:
<?xml version="1.0" encoding="UTF-8"?>
<mets xmlns="http://www.loc.gov/METS/"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"
TYPE="Magazine"
OBJID="urn:PUL:bluemountain:bmtnaad_1925-06-03_01"
LABEL="bmtnaad_1925-06-03_01">
<metsHdr>
<agent ROLE="CREATOR" TYPE="ORGANIZATION">
<name>Princeton University Library, Digital Initiatives</name>
</agent>
<metsDocumentID TYPE="URN">urn:PUL:bluemountain:td:bmtnaad_1925-06-03_01</metsDocumentID>
</metsHdr>
<dmdSec ID="dmd1">
<mdWrap MDTYPE="MODS">
<xmlData>
<!-- MODS record goes here -->
</xmlData>
</mdWrap>
</dmdSec>
<!--Use a single administrative section (<amdSec>) as a
wrapper for the technical metadata for all the images in a group-->
<amdSec ID="amdSec1">
<techMD ID="techmd1">
<!-- technical metadata (MIX) for first image -->
<mdWrap MDTYPE="NISOIMG">
<!-- The technical metadata docWorks provides goes here -->
</mdWrap>
</techMD>
<techMD ID="techmd2">
<!-- technical metadata for the second image -->
<mdWrap MDTYPE="NISOIMG"/>
</techMD>
<!-- <techMD> elements for remaining image files in this group -->
</amdSec>
<amdSec ID="amdSec2">
<!-- <techMD> elements for generative image files -->
</amdSec>
<amdSec ID="amdSec3">
<!-- <techMD> elements for preservation image files -->
</amdSec>
<amdSec ID="amdSec4">
<!-- <techMD> elements for delivery PDF files -->
</amdSec>
<amdSec ID="amdSec5">
<!-- <techMD> elements for high-resolution PDF files -->
</amdSec>
<fileSec>
<fileGrp ID="IMGGRP1" USE="Delivery Images">
<!-- Note that the AMDID attribute contains the ID of the
<techMD> element corresponding to the file. Note, too,
the use of the GROUPID attribute, which groups together
the image file, other resolutions, and its corresponding ALTO file. -->
<file ID="IMG001" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/delivery/bmtnaad_1925-06-03_01_001.jp2"/>
</file>
<file ID="IMG002" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/delivery/bmtnaad_1925-06-03_01_002.jp2"/>
</file>
</fileGrp>
<fileGrp ID="IMGGRP2" USE="Generative Images">
<file ID="IMG003" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/generative/bmtnaad_1925-06-03_01_001.jp2"/>
</file>
<file ID="IMG004" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/generative/bmtnaad_1925-06-03_01_002.jp2"/>
</file>
</fileGrp>
<fileGrp ID="IMGGRP3" USE="Preservation Images">
<file ID="IMG005" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/tiff" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01_001.tif"/>
</file>
<file ID="IMG006" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/tiff" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01_002.tif"/>
</file>
</fileGrp>
<fileGrp ID="PDFGRP1" USE="low-resolution PDF">
<file ID="PDF01" MIMETYPE="application/pdf" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01.pdf"/>
</file>
<fileGrp ID="PDFGRP2" USE="high-resolution PDF">
<file ID="PDF02" MIMETYPE="application/pdf" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01.pdf"/>
</file>
</fileGrp>
<fileGrp ID="ALTOGRP" USE="OCR">
<file ID="ALTO001" GROUPID="page1" MIMETYPE="text/xml" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file://.bmtnaad_1925-06-03_01_001.alto.xml"/>
</file>
<file ID="ALTO002" GROUPID="page2" MIMETYPE="text/xml" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
<FLocat LOCTYPE="URL" xlink:href="file://.bmtnaad_1925-06-03_01_002.alto.xml"/>
</file>
</fileGrp>
</fileSec>
<structMap TYPE="PHYSICAL">
<div/>
</structMap>
<structMap TYPE="LOGICAL">
<div/>
</structMap>
</mets>
The root element <mets> contains these attributes:
- TYPE
- the fixed value Magazine
- OBJID
- the URN for the issue
- LABEL
- the issueid
The <metsHdr> element shall contain two elements:
A constant value for all records:
<agent ROLE="CREATOR" TYPE="ORGANIZATION">
<name>Princeton University Library, Digital Initiatives</name>
</agent>
Contains a string whose contents is composed as follows:
PREFIX:ISSUID
Where PREFIX is the following fixed value:
urn:PUL:bluemountain:td:
And ISSUEID is the issue identifier, computed using the rules above.
The record contains a single <dmdSec> element with an ID attribute of “dmd1”’ it contains an embedded MODS record for the issue (described below).
The <amdSec> contains a <techMD> element for each image file (a <mix> record).
There are five <amdSec>s in an issue-level METS file, one for the files in each of the following groups:
- Delivery-level jp2s
- Generative jp2s
- Preservation-level tifs
- low-resolution issue-level PDF
- high-resolution issue-level PDF
The fileSec comprises six <fileGrp> elements: one for each of the image groups above and one for the ALTO records.
The IMGGRP file group contains <file> elements that indicate the location of each image file, with attributes linking the file to the corresponding technical metadata and to the corresponding ALTO file.
- ID
- a unique XML id
- AMDID
- the ID of the <techmd> element corresponding to the image file
- GROUPID
- an ID that links an image file to an ALTO file.
The image file for a page and the ALTO file containing the OCR output for that page share an id (conventionally named pageN, where N is a sequence number).
- MIMETYPE
- the constant “image/jp2” for jpeg2000 images
- CHECKSUM
- the checksum of the file, according to the
algorithm specified in CHECKSUMTYPE
- CHECKSUMTYPE
- the algorithm used to compute the checksum; usually SHA-1.
- LOCTYPE
- the constant URL
- xlink:href
- the path to the file. For this project, it will
be a local path. For example:
file://./bmtnaad_1925-06-03_01_001.jp2
Like IMGGRP1 (Delivery Images) but corresponding to the Generative JP2 images.
Like IMGGRP1 (Delivery Images) but corresponding to the Preservation tiff images.
Contains a single <file> element corresponding to the low-resolution PDF.
Contains a single <file> element corresponding to the high-resolution PDF.
Like the <fileGrp> for images, but corresponding to the ALTO files. (The ALTO files do not have technical metadata, so there is no AMDID attribute.)
The <structMap> element describes a hierarchical arrangement of the parts (<div>s) making up the digital object described by the METS. For this project, there are two kinds: a physical structMap, which delineates the pages of the newspaper issue in reading order, and a logical structMap, which functions as an outline of the newspaper’s contents. Both of these are assembled by docWorks, using configuration rules.
This structMap is a map of the entire object. Each of the <div>s corresponds to one of the file groups.
<structMap TYPE="Resource">
<div LABEL="delivery formats">
<div LABEL="low-resolution PDF"></div>
<div LABEL="high-resolution PDF"></div>
<div LABEL="preservation TIFF"></div>
<div LABEL="delivery JP2"></div>
<div LABEL="generative JP2"></div>
<div LABEL="ALTO"></div>
<div LABEL="TEI-encoded"></div>
</div>
</structMap>
The outlines below show the hierarchical relationship among the <div> elements in the logical structMap. Each div is described more fully below.
- Magazine
- Volume+
- Issue+
- Contents
- { Article* | Illustration* | Section* }
- Advertisements
- { SponsoredAd+ | Section* }
- Article
- Header*
- Contents
- Head+
- Byline*
- Body
- { Paragraph* | Section* }
- Illustration
- Graphic+
- Caption?
- Illustration
- Paragraph+
- SponsoredAd
- { Graphic* | Paragraph* }
- Section
- Header?
- Body
- SponsoredAd
- { Article* | Illustration* | SponsoredAd* | Section* }
- Paragraph
- TextBlock+
- Paragraph
The root <div> of the logical structMap is <div TYPE=”Magazine”>. It must contain one or more <div TYPE=”Volume”> elements (in practice it will contain only one).
Attributes:
- TYPE
- must be “Magazine”
- LABEL
- The name of the magazine, equivalent to the
top-level mods:titleInfo element.
A <div> representing a (possibly) bound volume of issues. In most cases, we are representing each issue of a magazine as a separate digital object, so the <div TYPE=”Volume”> element will in practice contain only one <div TYPE=”Issue”>.
Attributes:
- TYPE
- must be “VOLUME”
- LABEL
- The volume caption, if present
A <div> representing the actual issue. It contains the “contents” of the paper: the editorial content and the advertisements.
Attributes:
- TYPE
- must be “ISSUE”
- LABEL
- The issue number and the date of publication
- DMDID
- the ID of the <dmdSec> for the object (in practice,
always “dmd1”)
The Issue <div> contains, in most cases, three sub-<div>s: <div TYPE=”PublicationInfo”>, <div TYPE=”EditorialContent”> and <div TYPE=”SponsoredAdvertisements”>, described below.
Contains <div>s corresponding to the metadata about the magazine printed in the issue itself: mastheads, nameplates, folio lines, page numbers, etc.
Contains <div>s corresponding to the TextContent and Illustration elements, in publication order. These elements have DMDID attributes whose values link them to the corresponding <relatedItem> elements in the <mods> record.
Contains <div>s corresponding to the SponsoredAdvertisement elements, in publication order. These elements have DMDID attributes whose values link them to the corresponding <relatedItem> elements in the <mods> record.
A <div> representing a piece of editorial content: an article, a review, a letter, a poem, etc.
Editorial content takes a number of forms: it may or may not have a headline; it may or may not have a byline; it may have subsections, each with its own headline (subhead).
A TextContent <div> MAY contain a <div TYPE=”Header”>; it will always have a <div TYPE=”Body”>.
Attributes:
- TYPE
- must be “TextContent”
- DMDID
- the ID of the <mods:relatedItem type=”constituent”>
element corresponding to this piece in the newspaper.
- LABEL
- SHOULD be equivalent to the contents of the
mods:relatedItem/mods:titleInfo/mods:title element
A <div> containing the component’s (the TextContent, SponsoredAd, or Section) heading information: a combination of headline and byline. The Header may contain one or more Head elements (encompassing, for example, a headline and a subhead); it may also contain one or more Byline elements (which may not necessarily be physically contiguous in the physical layout of the page).
Attributes:
- TYPE
- must be “Header”
A <div> designating the region associated with a head of some kind: a headline, a subhead, etc.
Attributes:
- TYPE
- must be “Head”
A <div> designating one or more regions associated with the writer of an article: usually the writer’s name, but sometimes also the writer’s position or other biographical information.
Attributes:
- TYPE
- must be “Byline”
A container <div> for the body of an article or section. A BODY may contain paragraphs, illustrations, or sections, in any order.
A <div> that contains one or more text blocks representing the contents of a logical paragraph. Paragraphs have a sequential order within their containing article, caption, or sponsored ad.
Attributes:
- TYPE
- must be “Paragraph”
- ORDER
- the index of the paragraph in its containing div
(1, 2, etc.).
A section is a container <div> of other <div>s. It may or may not have a Header; it will contain some combination of articles, illustrations, SponsoredAds, and other sections.
A div designating the location of a graphic on the page.
A div designating the region of a block of text on a page.