HEB EPUB Generation - mlibrary/winterberry GitHub Wiki

This document provides the details for the generation of HEB EPUB books, including both fixed and flowable layout books.

The root directory location for the HEB EPUB books is the following:

ROOTDIR = tang.umdl.umich.edu:/quod-prep/prep/a/acls/hebepub

This location contains the following directories:

  • fixepub - directory contains the fixed layout book specifics. See Layouts for more information.
  • flowepub - directory contains the flowable layout book specifics. See Layouts for more information.
  • resources - directory contains the source for resources used for generating a EPUB. See section Resources for more information.
  • target - directory contains the files used to automate the generation process. See section Implementation for more information.

Table of Contents

  1. Table of Contents
  2. Layouts
  3. Resources
  4. HEB Book Source Structure
  5. Process Invocation
    1. Process 1 - Convert Page Scans (fixepub only)
    2. Process 2 - Generate EPUB Archive
      1. Step 2.1 - Copy DLXS XML Source
      2. Step 2.2 - Copy Resource Files
      3. Step 2.3 - Copy Font Files (flowepub only)
      4. Step 2.4 - Copy Stylesheet Files
      5. Step 2.5 - Determine Book Assets
      6. Step 2.6 - Generate TEI XML File
      7. Step 2.7 - Copy Book Cover Image(s)
      8. Step 2.8 - Determine Image Dimensions (fixepub only)
      9. Step 2.9 - Generate EPUB Structure
      10. Step 2.10 - Copy Image Files / Determine Image Dimensions (flowepub only)
      11. Step 2.11 - Zip EPUB Structure
    3. Process 3 - EPUB Structure Validation
    4. Process 4 - Generate Fulcrum Monograph Bundle Archive
      1. Step 4.1 - Generate Monograph Metadata CSV
      2. Step 4.2 - Determine Media Assets
      3. Step 4.3 - Zip Monograph Bundle
    5. Process 5 - Add Asset Links
    6. Process 6 - Clobber Source Files
  6. Implementation

Layouts

The process for generating a HEB EPUB archive differs based on the layout of the book. Currently, each layout has its own directory within the ROOTDIR location where the book source resides and any specific resources needed to generate the book. Below are the contents found within a layout directory:

  • books - this directory contains the books for the specified layout. Within this directory, each book has a directory with the book HEB ID as the name. See section HEB Book Source Structure for more information.

  • dlxs - this directory contains the original source DLXS XML files. For flowepub, the contents is the list of XML files with the HEB ID as the name.

For fixepub, this is a link to the directory tang.umdl.umich.edu:/n1/obj/h/e/b and the contents consists of a list of directories, one for each HEB book, with the name being the book HEB ID. Each book directory contains the source DLXS XML file and all original page scans.

Resources

The resources directory contains the source for the following information used for generating a HEB EPUB:

  • aclsdb - directory containing information provided by ACLS for HEB books. This information includes:

    • copyholder - if a HEB book has copyright information, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the copyright holder organizations and a URL contact for each.
    • related_title - if a HEB book has related titles, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the related titles.
    • reviews - if a HEB book has reviews, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the reviews.
    • series - if a HEB book has series designations, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the designations.
    • subject - if a HEB book has subject designations, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the subjects.
  • assets - directory containing all assets for all HEB books, including cover image, audio, video, and PDF files.

    NOTE: this directory is a link to the following directory: tang.umdl.umich.edu:/n1/web/a/acls/images

  • asset_links - if a HEB book currently has a Fulcrum monograph page with assets listed on the Media tab, then this directory contains a CSV file with the name of the book HEB ID that contains the information for each book asset.

  • marc - if a HEB book has a MARC record, then this directory contains a XML file representation of the record. The file name is the first 5 digits of the HEB ID.

HEB Book Source Structure

The EPUB source for a HEB book can be found in a directory at the following path:

ROOTDIR/layout/hebxxxxx.xxxx.xxx

where layout is either fixepub or flowepub and hebxxxxx.xxxx.xxx is the HEB ID assigned to the book. For example, the source for the fixed layout book with ID heb04015.0001.001 can be found in the following directory:

ROOTDIR/fixepub/heb04015.0001.001

The source directory for the HEB book with the ID hebxxxxx.xxxx.xxx has the following layout:

  • epub - directory containing the unzipped EPUB book structure, including:
    • mimetype
    • META-INF/{container,metadata}.xml
    • META-INF/src - this directory contains the source XML files used to generate this EPUB, including the original DLXS and TEI XML. The allows for the book source to be contained within the EPUB archive itself. Below is a list of the expected contents:
      • hebxxxxx.xxxx.xxx_dlxs.xml - DLXS XML source for the book.
      • hebxxxxx.xxxx.xxx_dlxs_org.xml - original DLXS XML source for the book. This is present only if it required modification before being processed.
      • hebxxxxx.xxxx.xxx_tei.xml - TEI XML source generated by transformation of the DLXS XML source. This file is used to generate the EPUB archive.
      • assets.html - HTML table referenced by the TEI XML file that lists all assets associated with this book, including cover images, book images, audio, video, PDFs, etc. For each asset, a column exists indicating the following concerning each asset:
        • full path.

        • mime-type.

        • whether it should be included in the EPUB archive.

        • whether it should be listed in the monograph Media tab.

        • whether it is a cover image.

        • whether is a hi-resolution version of the image (determined by the -lg suffix on the file name).

        • width in pixels, if asset is an image.

        • height in pixels, if asset is an image.

        • title if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.

        • NOID if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.

        • URL link to asset page, if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.

        • embed code markup if asset has previously been uploaded to Fulcrum.

          NOTE: for fixed layout books, the page scans are not included in this list.

      • copyholder.html - copy of copyholder.
      • fonts.html - HTML table referenced by the TEI XML file that lists all fonts to be included within the EPUB archive.
      • images.html - HTML table referenced by the TEI XML file that lists all images referenced by the book. For each image, a column exists indicating the format, width, and height. Useful for setting the viewport metadata within a fixed layout page scan HTML file.
      • marc.xml - copy of marc.
      • related.html - copy of related_title. If not empty, then this file is uploaded to Fulcrum and used as the contents to the monograph page Related Titles tab.
      • reviews.html - copy of reviews. If not empty, then this file is uploaded to Fulcrum and used as the contents to the monograph page Reviews tab.
      • series.html - copy of series.
      • stylesheets.html - HTML table referenced by the TEI XML file that lists all CSS stylesheets to be included within the EPUB archive.
      • subject.html - copy of subject.
    • OEBPS - contains the book content.
      • fonts - directory containing the book font files listed in the fonts.html file. This directory is present only for flowepub.
      • images - directory containing the images listed in the images.html file. The book cover images reside here. For the fixed layout books, the page scan images reside here. For flowepub, all images referenced by the hebxxxxx.xxxx.xxx_tei.xml file reside here.
      • styles - directory containing the CSS stylesheet files listed in the stylesheets.html file.
      • xhtml - directory containing the HTML/XHTML files for this book.
      • content_{fixed_ocr,fixed_scan,flow}.opf - EPUB package file. For fixepub, there exists two files, one for the page scan rendition (content_fixed_scan.opf) and the second for the text OCR rendition (content_fixed_ocr.opf). For flowepub, there is one file (content_flow.opf).
      • toc_{fixed_ocr,fixed_scan, flow}.xhtml - EPUB TOC files. For fixepub, there exists two files, one for the page scan rendition (toc_fixed_scan.xthml) and the second for the text OCR rendition (toc_fixed_ocr.xhtml). For flowepub, there is one file (toc_flow.xhtml).
      • page_list_{fixed_ocr,fixed_scan,flow}.xhtml - EPUB page list files. For fixepub, there exists two files, one for the page scan rendition (pagelist_fixed_scan.xhtml) and the second for the text OCR rendition (pagelist_fixed_ocr.xhtml). For flowepub, there is one file (pagelist_flow.xhtml).
      • chapter_list_{fixed_ocr,fixed_scan,flow}.xhtml - EPUB chapter list files. For fixepub, there exists two files, one for the page scan rendition (chapterlist_fixed_scan.xthml) and the second for the text OCR rendition (chapterlist_fixed_ocr.xhtml). For flowepub, there is one file (chapterlist_flow.xhtml).
  • hebxxxxx.xxxx.xxx.epub - archive of the epub directory described above.
  • hebxxxxx.xxxx.xxx_metadata.csv - CSV file the contains the metadata for this book, including the monograph information, asset information, reviews.html, related.html, and the cover image.
  • hebxxxxx.xxxx.xxx.zip - zip file that can be used as a Fulcrum bundle for upload to create a new monograph. The following may be included:
  • epubcheck.xml - may exists as a result of an invocation of the check production. Contains the output from epubcheck validation.

Process Invocation

The command for invoking any process is the ruby script hebepub found within the ROOTDIR/winterberry/script directory. The syntax for this command is:

hebepub production [hebDir…]
  • production
  • hebDir - path to 1 or more HEB EPUB source directories. If no directory is specified, then the current directory is assumed. If the specified directory does not exist, then a directory will be created.

This script sets required environment variables and traverses the list of specified HEB source directories, invoking a Rake task specified by the production parameter on each. See rakefile for the details.

Below are example invocations:

  1. For the XML title hebxxxxx.xxxx.xxx, the following commands will generate the book EPUB archive file:

    cd /quod-prep/prep/a/acls/hebepub
    target/bin/hebepub epub flowepub/books/hebxxxxx.xxxx.xxx  
    

    The resulting EPUB archive file is stored at flowepub/books/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.epub.

  2. For the backlist title hebxxxxx.xxxx.xxx, the following commands will generate the book Fulcrum bundle file:

    cd /quod-prep/prep/a/acls/hebepub
    target/bin/hebepub bundle  fixepub/books/hebxxxxx.xxxx.xxx  
    

    The resulting bundle file is stored at
    fixepub/books/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.zip.

  3. The following commands will rebuild the bundle file for backlist title hebxxxxx.xxxx.xxx:

    cd /quod-prep/prep/a/acls/hebepub
    target/bin/hebepub clobber fixepub/books/hebxxxxx.xxxx.xxx  
    target/bin/hebepub bundle  fixepub/books/hebxxxxx.xxxx.xxx  
    
  4. The following commands will invoke epubcheck on the specified EPUB source directory:

    cd /quod-prep/prep/a/acls/hebepub  
    target/bin/hebepub check fixepub/books/hebxxxxx.xxxx.xxx  
    

    The output from epubcheck is stored in the file
    fixepub/books/hebxxxxx.xxxx.xxx/epubcheck.xml

  5. The following commands will convert page scan TIF images to PNG for backlist title hebxxxxx.xxxx.xxx:

    cd /quod-prep/prep/a/acls/hebepub
    target/bin/hebepub convert fixepub/books/hebxxxxx.xxxx.xxx  
    

    The new PNG files are stored in the directory:
    fixepub/books/hebxxxxx.xxxx.xxx/epub/OEBPS/images

The following sections describe the steps necessary to perform the HEB EPUB processes.

Process 1 - Convert Page Scans (fixepub only)

For fixepub books, original page scan files for the book are expected to reside in the following directory:

ROOTDIR/fixepub/dlxs/hebxxxxx.xxxx.xxx

where hebxxxxx.xxxx.xxx is the HEB ID assigned to the book. The file format is expected to be either TIF or JP2. The EPUB specification does not support TIF, so it is necessary to convert the original scans to PNG.

The conversion can be done by invoking the following commands:

cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub convert layout/hebxxxxx.xxxx.xxx

The new PNG files are stored in the images directory within the hebxxxxx.xxxx.xxx source directory.

Process 2 - Generate EPUB Archive

Below are the process steps necessary to generate the HEB book EPUB archive. The following commands can be used to invoke this process and generate a HEB EPUB for the book with the ID hebxxxxx.xxxx.xxx:

cd /quod-prep/prep/a/acls/hebepub  
target/bin/hebepub epub layout/hebxxxxx.xxxx.xxx  

Step 2.1 - Copy DLXS XML Source

The book DLXS XML source is expected to be found at the path listed in the table below and copied to path hebxxxxx.xxxx.xxx_dlxs.xml.

Layout

Source Path

fixepub

ROOTDIR/fixepub/dlxs/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.xml

flowepub

ROOTDIR/flowepub/dlxs/hebxxxxx.xxxx.xxx.xml

Step 2.2 - Copy Resource Files

The following resources are copied from their source paths into the METAINFSRC directory and referenced by the hebxxxxx.xxxx.xxx_tei.xml file:

Resource

Source Paths

marc.xml

ROOTDIR/resources/marc/hebxxxxx.xml

copyholder.html

ROOTDIR/resources/aclsdb/copyholder/hebxxxxx.xxxx.xxx.xml

related.html

ROOTDIR/resources/aclsdb/related_title/hebxxxxx.xxxx.xxx.xml

reviews.html

ROOTDIR/resources/aclsdb/reviews/hebxxxxx.xxxx.xxx.xml

series.html

ROOTDIR/resources/aclsdb/series/hebxxxxx.xxxx.xxx.xml

subject.html

ROOTDIR/resources/aclsdb/subject/hebxxxxx.xxxx.xxx.xml

Step 2.3 - Copy Font Files (flowepub only)

The font files are found in the directory

ROOTDIR/flowepub/fonts

and copied into the fonts directory. The fonts.html file is generated during this step. For fixepub, currently there are no fonts to include. So, this file is empty.

Step 2.4 - Copy Stylesheet Files

The CSS stylesheet file(s) are found in the directory

ROOTDIR/layout/styles

where layout is either fixepub or flowepub and are copied into the styles directory. The stylesheets.html file is generated during this step.

Step 2.5 - Determine Book Assets

Assets include book cover image/audio/video/PDF/etc files. These files use the book HEB ID as the prefix for the filename and are expected to reside in the directory:

ROOTDIR/resources/assets

No files are copied during this step. The assets.html file is generated.

Step 2.6 - Generate TEI XML File

Transform the DLXS XML file hebxxxxx.xxxx.xxx_dlxs.xml to the TEI XML file hebxxxxx.xxxx.xxx_tei.xml using the XSLT stylesheet hebdlxs2tei.xsl.

Step 2.7 - Copy Book Cover Image(s)

The cover image(s) are located at the path:

ROOTDIR/resources/assets/hebxxxxx.xxxx.xxx.jpg ROOTDIR/resources/assets/hebxxxxx.xxxx.xxx-lg.jpg (higher resolution, may not be present)

and copied in the images directory.

Step 2.8 - Determine Image Dimensions (fixepub only)

For fixepub only, all images found in the images directory, including cover images and page scan image files (see Process 1 are used as input for generating the images.html file.

Step 2.9 - Generate EPUB Structure

The TEI XML file hebxxxxx.xxxx.xxx_tei.xml contains references to the resources (marc.xml, copyholder.html, related.html, reviews.html, series.html, subject.html), fonts (fonts.html), CSS stylesheets (stylesheets.html), and book assets (assets.html). It is used as the input to a XSLT transformation to produce the EPUB book content found in the OEBPS directory. The XSLT stylesheets hebtei2fixepub.xsl (fixepub) and hebtei2flowepub.xsl (flowepub) are used for the transformation.

Step 2.10 - Copy Image Files / Determine Image Dimensions (flowepub only)

For flowepub only, all images referenced by the TEI XML file hebxxxxx.xxxx.xxx_tei.xml are copied to the images directory. Then all images found in the images directory, including cover images, are used as input for generating the images.html file.

Step 2.11 - Zip EPUB Structure

The hebxxxxx.xxxx.xxx.epub file is generated by zipping the contents of the epub directory.

Process 3 - EPUB Structure Validation

Any HEB EPUB directory can be validated by invoking the following commands:

cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub check layout/books/hebxxxxx.xxxx.xxx

where layout is either fixepub or flowepub. The results can be found in the file

ROOTDIR/layout/books/hebxxxxx.xxxx.xxx/epubcheck.xml

Process 4 - Generate Fulcrum Monograph Bundle Archive

To create a new monograph for a HEB EPUB, the hebxxxxx.xxxx.xxx.zip may be generated by invoking the following commands:

cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub bundle layout/hebxxxxx.xxxx.xxx

where layout is either fixepub or flowepub. Below are the process steps for generating the bundle. This process may invoke Process 2 if the hebxxxxx.xxxx.xxx.epub has not been previously generated.

Step 4.1 - Generate Monograph Metadata CSV

The TEI XML file hebxxxxx.xxxx.xxx_tei.xml is used as the input to a XSLT transformation to produce the monograph metadata file hebxxxxx.xxxx.xxx_metadata.csv. The XSLT stylesheet hebtei2meta.xsl is used for the transformation.

Step 4.2 - Determine Media Assets

The assets.html file is used to determine the assets that are to be uploaded to Fulcrum and listed on the monograph Media tab.

Step 4.3 - Zip Monograph Bundle

The hebxxxxx.xxxx.xxx.zip can be generated as the necessary files have been generated.

Process 5 - Add Asset Links

Links to existing Fulcrum Media asset pages may be added to an EPUB by performing the following steps:

  1. Download the asset CSV file from the Fulcrum monograph page.
  2. Rename the CSV using HEB ID, hebxxxxx.xxxx.xxx.csv, and store it in the ROOTDIR/resources/asset_links directory.
  3. Invoke the commands in the section Generate EPUB Archive to regenerate the EPUB archive.
  4. Replace or re-version the EPUB archive file asset on the Fulcrum monograph page.

Process 6 - Clobber Source Files

To re-generate either a HEB EPUB or Fulcrum bundle, invoke the following commands:

cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub clobber layout/hebxxxxx.xxxx.xxx

This removes the necessary files to allow the HEB book source directory to be rebuilt from scratch. This removes all files except the following:

For fixepub, the above files are not removed and also the following is not removed:

  • images directory. Since often the original TIF files have been converted to PNG and this process can be time consuming, the page scans are not removed.

Implementation

Below is a description of the ROOTDIR/target directory and the files contained within:

⚠️ **GitHub.com Fallback** ⚠️