METS profiles - pulibrary/BlueMountain GitHub Wiki

METS Profile

There are two kinds of METS records in Blue Mountain:

  1. Title-Level METS – A METS document encapsulating information about the magazine title as a whole.
  2. Issue-Level METS – A METS document encapsulating information about an individual issue of a magazine.

These are described in greater detail below.

Title-Level METS

( Greater detail to come. )

The metadata for the title will be encapsulated in a title-level METS record: the title-level descriptive metadata (either as an embedded MODS record or pointed to), a pointer to the bibliographic history, and (possibly) pointers to issue-level metadata.

Issue-Level METS

Thinking in terms of FRBR’s Group 1 entities (Work, Expression, Manifestation, Item):

  • The work is the intellectual notion of a particular issue: “The third issue of Dada appeared in December, 1918.”
  • The expression is the abstract realization of the work in some form: the French version of Dada 4-5, for example, contains different articles from the X version.
  • The manifestation is the physical embodiment of an expression: the French issuance of Dada 4-5.
  • The item is the physical copy. The files are likewise items.

The metadata for each issue shall be encapsulated in a METS record. A skeleton sample of such a record is the following:

<?xml version="1.0" encoding="UTF-8"?>
<mets xmlns="http://www.loc.gov/METS/"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd" 
      TYPE="Magazine"
      OBJID="urn:PUL:bluemountain:bmtnaad_1925-06-03_01"
      LABEL="bmtnaad_1925-06-03_01">
 	<metsHdr>
    <agent ROLE="CREATOR" TYPE="ORGANIZATION">
      <name>Princeton University Library, Digital Initiatives</name>
    </agent>
    <metsDocumentID TYPE="URN">urn:PUL:bluemountain:td:bmtnaad_1925-06-03_01</metsDocumentID>
 	</metsHdr>
 	<dmdSec ID="dmd1">
   <mdWrap MDTYPE="MODS">
    <xmlData>
     <!-- MODS record goes here -->
    </xmlData>
   </mdWrap>
 	</dmdSec>
 	
    <!--Use a single administrative section (<amdSec>) as a 
        wrapper for the technical metadata for all the images in a group-->
 	<amdSec ID="amdSec1">
    <techMD ID="techmd1">
      <!-- technical metadata (MIX) for first image -->
      <mdWrap MDTYPE="NISOIMG">
        <!-- The technical metadata docWorks provides goes here -->
      </mdWrap>
    </techMD>
    <techMD ID="techmd2">
      <!-- technical metadata for the second image -->
      <mdWrap MDTYPE="NISOIMG"/>
    </techMD>
      
    <!-- <techMD> elements for remaining image files in this group -->
 	</amdSec>
 	
 	<amdSec ID="amdSec2">
    <!-- <techMD> elements for generative image files -->
 	</amdSec>
 	
 	<amdSec ID="amdSec3">
    <!-- <techMD> elements for preservation image files -->
 	</amdSec>
 	
 	<amdSec ID="amdSec4">
    <!-- <techMD> elements for delivery PDF files -->
 	</amdSec>
 	
 	<amdSec ID="amdSec5">
    <!-- <techMD> elements for high-resolution PDF files -->
 	</amdSec>
 	
 	<fileSec>
    <fileGrp ID="IMGGRP1" USE="Delivery Images">
 	
      <!-- Note that the AMDID attribute contains the ID of the
      <techMD> element corresponding to the file. Note, too,
      the use of the GROUPID attribute, which groups together
      the image file, other resolutions, and its corresponding ALTO file. -->
 	
      <file ID="IMG001" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/delivery/bmtnaad_1925-06-03_01_001.jp2"/>
      </file>
      <file ID="IMG002" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/delivery/bmtnaad_1925-06-03_01_002.jp2"/>
      </file>
    </fileGrp>
 	
    <fileGrp ID="IMGGRP2" USE="Generative Images">
 	
      <file ID="IMG003" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/generative/bmtnaad_1925-06-03_01_001.jp2"/>
      </file>
      <file ID="IMG004" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/jp2" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/generative/bmtnaad_1925-06-03_01_002.jp2"/>
      </file>
    </fileGrp>
 	
    <fileGrp ID="IMGGRP3" USE="Preservation Images">
 	
      <file ID="IMG005" GROUPID="page1" AMDID="techmd1" MIMETYPE="image/tiff" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01_001.tif"/>
      </file>
      <file ID="IMG006" GROUPID="page2" AMDID="techmd2" MIMETYPE="image/tiff" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01_002.tif"/>
      </file>
    </fileGrp>
 	
    <fileGrp ID="PDFGRP1" USE="low-resolution PDF"> 
      <file ID="PDF01"  MIMETYPE="application/pdf" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/astore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01.pdf"/>
      </file>
 	
    <fileGrp ID="PDFGRP2" USE="high-resolution PDF"> 
      <file ID="PDF02"  MIMETYPE="application/pdf" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file:///usr/share/BlueMountain/pstore/periodicals/bmtnaad/issues/1925/06/03_01/bmtnaad_1925-06-03_01.pdf"/>
      </file>
    </fileGrp>
 	
    <fileGrp ID="ALTOGRP" USE="OCR">
      <file ID="ALTO001" GROUPID="page1" MIMETYPE="text/xml" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file://.bmtnaad_1925-06-03_01_001.alto.xml"/>
      </file>
      <file ID="ALTO002" GROUPID="page2" MIMETYPE="text/xml" CHECKSUM="xxxx" CHECKSUMTYPE="SHA-1">
        <FLocat LOCTYPE="URL" xlink:href="file://.bmtnaad_1925-06-03_01_002.alto.xml"/>
      </file>
    </fileGrp>
 	</fileSec>
 	<structMap TYPE="PHYSICAL">
    <div/>
 	</structMap>
 	<structMap TYPE="LOGICAL">
    <div/>
 	</structMap>
</mets>

The root element <mets> contains these attributes:

TYPE
the fixed value Magazine
OBJID
the URN for the issue
LABEL
the issueid

<metsHdr>

The <metsHdr> element shall contain two elements:

<agent>

A constant value for all records:

<agent ROLE="CREATOR" TYPE="ORGANIZATION">
 <name>Princeton University Library, Digital Initiatives</name>
</agent>

<metsDocumentID TYPE=”URN”>

Contains a string whose contents is composed as follows:

PREFIX:ISSUID

Where PREFIX is the following fixed value:

urn:PUL:bluemountain:td:

And ISSUEID is the issue identifier, computed using the rules above.

<dmdSec>

The record contains a single <dmdSec> element with an ID attribute of “dmd1”’ it contains an embedded MODS record for the issue (described below).

<amdSec ID=”amdSecN”> elements

The <amdSec> contains a <techMD> element for each image file (a <mix> record).

There are five <amdSec>s in an issue-level METS file, one for the files in each of the following groups:

  1. Delivery-level jp2s
  2. Generative jp2s
  3. Preservation-level tifs
  4. low-resolution issue-level PDF
  5. high-resolution issue-level PDF

<fileSec>

The fileSec comprises six <fileGrp> elements: one for each of the image groups above and one for the ALTO records.

<fileGrp ID=”IMGGRP1” USE=”Delivery Images”>

The IMGGRP file group contains <file> elements that indicate the location of each image file, with attributes linking the file to the corresponding technical metadata and to the corresponding ALTO file.

<file>
ID
a unique XML id
AMDID
the ID of the <techmd> element corresponding to the image file
GROUPID
an ID that links an image file to an ALTO file.

The image file for a page and the ALTO file containing the OCR output for that page share an id (conventionally named pageN, where N is a sequence number).

MIMETYPE
the constant “image/jp2” for jpeg2000 images
CHECKSUM
the checksum of the file, according to the

algorithm specified in CHECKSUMTYPE

CHECKSUMTYPE
the algorithm used to compute the checksum; usually SHA-1.
<FLocat> The METS element indicating the actual file location.
LOCTYPE
the constant URL
xlink:href
the path to the file. For this project, it will be a local path. For example:
		      file://./bmtnaad_1925-06-03_01_001.jp2
      

<fileGrp ID=”IMGGRP2” USE=”Generative Images”>

Like IMGGRP1 (Delivery Images) but corresponding to the Generative JP2 images.

<fileGrp ID=”IMGGRP3” USE=”Preservation Images”>

Like IMGGRP1 (Delivery Images) but corresponding to the Preservation tiff images.

<fileGrp ID=”PDFGRP1” USE=”low-resolution PDF”>

Contains a single <file> element corresponding to the low-resolution PDF.

<fileGrp ID=”PDFGRP2” USE=”high-resolution PDF”>

Contains a single <file> element corresponding to the high-resolution PDF.

<fileGrp ID=”ALTOGRP” USE=”OCR”>

Like the <fileGrp> for images, but corresponding to the ALTO files. (The ALTO files do not have technical metadata, so there is no AMDID attribute.)

<structMap>

The <structMap> element describes a hierarchical arrangement of the parts (<div>s) making up the digital object described by the METS. For this project, there are two kinds: a physical structMap, which delineates the pages of the newspaper issue in reading order, and a logical structMap, which functions as an outline of the newspaper’s contents. Both of these are assembled by docWorks, using configuration rules.

<structMap type=”RESOURCE”>

This structMap is a map of the entire object. Each of the <div>s corresponds to one of the file groups.

<structMap TYPE="Resource">
 <div LABEL="delivery formats">
  <div LABEL="low-resolution PDF"></div>
  <div LABEL="high-resolution PDF"></div>
  <div LABEL="preservation TIFF"></div>
  <div LABEL="delivery JP2"></div>
  <div LABEL="generative JP2"></div>
  <div LABEL="ALTO"></div>
  <div LABEL="TEI-encoded"></div>
 </div>
</structMap>

<structMap type=”PHYSICAL”>

<structMap type=”LOGICAL”>

The <div> hierarchy

The outlines below show the hierarchical relationship among the <div> elements in the logical structMap. Each div is described more fully below.

  • Magazine
    • Volume+
  • Issue+
    • Contents
      • { Article* | Illustration* | Section* }
    • Advertisements
      • { SponsoredAd+ | Section* }
      • Article
        • Header*
  • Head+
  • Byline*
    • Body
  • { Paragraph* | Section* }
    • Illustration
      • Graphic+
      • Caption?
  • Paragraph+
    • SponsoredAd
      • { Graphic* | Paragraph* }
    • Section
      • Header?
      • Body
  • { Article* | Illustration* | SponsoredAd* | Section* }
    • Paragraph
      • TextBlock+
<div TYPE=”Magazine”>

The root <div> of the logical structMap is <div TYPE=”Magazine”>. It must contain one or more <div TYPE=”Volume”> elements (in practice it will contain only one).

Attributes:

TYPE
must be “Magazine”
LABEL
The name of the magazine, equivalent to the

top-level mods:titleInfo element.

<div TYPE=”Volume”>?

A <div> representing a (possibly) bound volume of issues. In most cases, we are representing each issue of a magazine as a separate digital object, so the <div TYPE=”Volume”> element will in practice contain only one <div TYPE=”Issue”>.

Attributes:

TYPE
must be “VOLUME”
LABEL
The volume caption, if present
<div TYPE=”Issue”>

A <div> representing the actual issue. It contains the “contents” of the paper: the editorial content and the advertisements.

Attributes:

TYPE
must be “ISSUE”
LABEL
The issue number and the date of publication
DMDID
the ID of the <dmdSec> for the object (in practice,

always “dmd1”)

The Issue <div> contains, in most cases, three sub-<div>s: <div TYPE=”PublicationInfo”>, <div TYPE=”EditorialContent”> and <div TYPE=”SponsoredAdvertisements”>, described below.

<div TYPE=”PublicationInfo”>

Contains <div>s corresponding to the metadata about the magazine printed in the issue itself: mastheads, nameplates, folio lines, page numbers, etc.

<div TYPE=”EditorialContent” LABEL=”Contents”>

Contains <div>s corresponding to the TextContent and Illustration elements, in publication order. These elements have DMDID attributes whose values link them to the corresponding <relatedItem> elements in the <mods> record.

<div TYPE=”SponsoredAdvertisements” LABEL=”Advertisements”>

Contains <div>s corresponding to the SponsoredAdvertisement elements, in publication order. These elements have DMDID attributes whose values link them to the corresponding <relatedItem> elements in the <mods> record.

<div TYPE=”TextContent”>

A <div> representing a piece of editorial content: an article, a review, a letter, a poem, etc.

Editorial content takes a number of forms: it may or may not have a headline; it may or may not have a byline; it may have subsections, each with its own headline (subhead).

A TextContent <div> MAY contain a <div TYPE=”Header”>; it will always have a <div TYPE=”Body”>.

Attributes:

TYPE
must be “TextContent”
DMDID
the ID of the <mods:relatedItem type=”constituent”>

element corresponding to this piece in the newspaper.

LABEL
SHOULD be equivalent to the contents of the

mods:relatedItem/mods:titleInfo/mods:title element

<div TYPE=”Header”>

A <div> containing the component’s (the TextContent, SponsoredAd, or Section) heading information: a combination of headline and byline. The Header may contain one or more Head elements (encompassing, for example, a headline and a subhead); it may also contain one or more Byline elements (which may not necessarily be physically contiguous in the physical layout of the page).

Attributes:

TYPE
must be “Header”
<div TYPE=”Head”>

A <div> designating the region associated with a head of some kind: a headline, a subhead, etc.

Attributes:

TYPE
must be “Head”
<div TYPE=”Byline”>

A <div> designating one or more regions associated with the writer of an article: usually the writer’s name, but sometimes also the writer’s position or other biographical information.

Attributes:

TYPE
must be “Byline”
<div TYPE=”Body”>

A container <div> for the body of an article or section. A BODY may contain paragraphs, illustrations, or sections, in any order.

<div TYPE=”Paragraph”>

A <div> that contains one or more text blocks representing the contents of a logical paragraph. Paragraphs have a sequential order within their containing article, caption, or sponsored ad.

Attributes:

TYPE
must be “Paragraph”
ORDER
the index of the paragraph in its containing div

(1, 2, etc.).

<div TYPE=”Section”>

A section is a container <div> of other <div>s. It may or may not have a Header; it will contain some combination of articles, illustrations, SponsoredAds, and other sections.

<div TYPE=”Illustration”>
<div TYPE=”Graphic”>

A div designating the location of a graphic on the page.

<div TYPE=”Caption”>
<div TYPE=”SponsoredAd”>
<div TYPE=”TextBlock”>

A div designating the region of a block of text on a page.

⚠️ **GitHub.com Fallback** ⚠️