HEB EPUB Generation - mlibrary/winterberry GitHub Wiki
This document provides the details for the generation of HEB EPUB books, including both fixed and flowable layout books.
The root directory location for the HEB EPUB books is the following:
ROOTDIR = tang.umdl.umich.edu:/quod-prep/prep/a/acls/hebepub
This location contains the following directories:
- fixepub - directory contains the fixed layout book specifics. See Layouts for more information.
- flowepub - directory contains the flowable layout book specifics. See Layouts for more information.
- resources - directory contains the source for resources used for generating a EPUB. See section Resources for more information.
- target - directory contains the files used to automate the generation process. See section Implementation for more information.
- Table of Contents
- Layouts
- Resources
- HEB Book Source Structure
-
Process Invocation
- Process 1 - Convert Page Scans (fixepub only)
-
Process 2 - Generate EPUB Archive
- Step 2.1 - Copy DLXS XML Source
- Step 2.2 - Copy Resource Files
- Step 2.3 - Copy Font Files (flowepub only)
- Step 2.4 - Copy Stylesheet Files
- Step 2.5 - Determine Book Assets
- Step 2.6 - Generate TEI XML File
- Step 2.7 - Copy Book Cover Image(s)
- Step 2.8 - Determine Image Dimensions (fixepub only)
- Step 2.9 - Generate EPUB Structure
- Step 2.10 - Copy Image Files / Determine Image Dimensions (flowepub only)
- Step 2.11 - Zip EPUB Structure
- Process 3 - EPUB Structure Validation
- Process 4 - Generate Fulcrum Monograph Bundle Archive
- Process 5 - Add Asset Links
- Process 6 - Clobber Source Files
- Implementation
The process for generating a HEB EPUB archive differs based on the layout of the book. Currently, each layout has its own directory within the ROOTDIR location where the book source resides and any specific resources needed to generate the book. Below are the contents found within a layout directory:
-
books - this directory contains the books for the specified layout. Within this directory, each book has a directory with the book HEB ID as the name. See section HEB Book Source Structure for more information.
-
dlxs - this directory contains the original source DLXS XML files. For flowepub, the contents is the list of XML files with the HEB ID as the name.
For fixepub, this is a link to the directory tang.umdl.umich.edu:/n1/obj/h/e/b and the contents consists of a list of directories, one for each HEB book, with the name being the book HEB ID. Each book directory contains the source DLXS XML file and all original page scans.
The resources directory contains the source for the following information used for generating a HEB EPUB:
-
aclsdb - directory containing information provided by ACLS for HEB books. This information includes:
- copyholder - if a HEB book has copyright information, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the copyright holder organizations and a URL contact for each.
- related_title - if a HEB book has related titles, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the related titles.
- reviews - if a HEB book has reviews, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the reviews.
- series - if a HEB book has series designations, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the designations.
- subject - if a HEB book has subject designations, then this directory contains a HTML file with the name of the book HEB ID that consists of a table listing the subjects.
-
assets - directory containing all assets for all HEB books, including cover image, audio, video, and PDF files.
NOTE: this directory is a link to the following directory: tang.umdl.umich.edu:/n1/web/a/acls/images
-
asset_links - if a HEB book currently has a Fulcrum monograph page with assets listed on the Media tab, then this directory contains a CSV file with the name of the book HEB ID that contains the information for each book asset.
-
marc - if a HEB book has a MARC record, then this directory contains a XML file representation of the record. The file name is the first 5 digits of the HEB ID.
The EPUB source for a HEB book can be found in a directory at the following path:
ROOTDIR/layout/hebxxxxx.xxxx.xxx
where layout is either fixepub or flowepub and hebxxxxx.xxxx.xxx is the HEB ID assigned to the book. For example, the source for the fixed layout book with ID heb04015.0001.001 can be found in the following directory:
ROOTDIR/fixepub/heb04015.0001.001
The source directory for the HEB book with the ID hebxxxxx.xxxx.xxx has the following layout:
-
epub - directory containing the unzipped EPUB book structure, including:
- mimetype
- META-INF/{container,metadata}.xml
-
META-INF/src - this directory contains the source XML files used to generate this EPUB, including the original DLXS and TEI XML. The allows for the book source to be contained within the EPUB archive itself. Below is a list of the expected contents:
- hebxxxxx.xxxx.xxx_dlxs.xml - DLXS XML source for the book.
- hebxxxxx.xxxx.xxx_dlxs_org.xml - original DLXS XML source for the book. This is present only if it required modification before being processed.
- hebxxxxx.xxxx.xxx_tei.xml - TEI XML source generated by transformation of the DLXS XML source. This file is used to generate the EPUB archive.
-
assets.html - HTML table referenced by the TEI XML file that lists all assets associated with this book, including cover images, book images, audio, video, PDFs, etc. For each asset, a column exists indicating the following concerning each asset:
-
full path.
-
mime-type.
-
whether it should be included in the EPUB archive.
-
whether it should be listed in the monograph Media tab.
-
whether it is a cover image.
-
whether is a hi-resolution version of the image (determined by the -lg suffix on the file name).
-
width in pixels, if asset is an image.
-
height in pixels, if asset is an image.
-
title if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.
-
NOID if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.
-
URL link to asset page, if asset has previously been uploaded to Fulcrum. Generated from the contents of asset_links.
-
embed code markup if asset has previously been uploaded to Fulcrum.
NOTE: for fixed layout books, the page scans are not included in this list.
-
- copyholder.html - copy of copyholder.
- fonts.html - HTML table referenced by the TEI XML file that lists all fonts to be included within the EPUB archive.
- images.html - HTML table referenced by the TEI XML file that lists all images referenced by the book. For each image, a column exists indicating the format, width, and height. Useful for setting the viewport metadata within a fixed layout page scan HTML file.
- marc.xml - copy of marc.
- related.html - copy of related_title. If not empty, then this file is uploaded to Fulcrum and used as the contents to the monograph page Related Titles tab.
- reviews.html - copy of reviews. If not empty, then this file is uploaded to Fulcrum and used as the contents to the monograph page Reviews tab.
- series.html - copy of series.
- stylesheets.html - HTML table referenced by the TEI XML file that lists all CSS stylesheets to be included within the EPUB archive.
- subject.html - copy of subject.
-
OEBPS - contains the book content.
- fonts - directory containing the book font files listed in the fonts.html file. This directory is present only for flowepub.
- images - directory containing the images listed in the images.html file. The book cover images reside here. For the fixed layout books, the page scan images reside here. For flowepub, all images referenced by the hebxxxxx.xxxx.xxx_tei.xml file reside here.
- styles - directory containing the CSS stylesheet files listed in the stylesheets.html file.
- xhtml - directory containing the HTML/XHTML files for this book.
- content_{fixed_ocr,fixed_scan,flow}.opf - EPUB package file. For fixepub, there exists two files, one for the page scan rendition (content_fixed_scan.opf) and the second for the text OCR rendition (content_fixed_ocr.opf). For flowepub, there is one file (content_flow.opf).
- toc_{fixed_ocr,fixed_scan, flow}.xhtml - EPUB TOC files. For fixepub, there exists two files, one for the page scan rendition (toc_fixed_scan.xthml) and the second for the text OCR rendition (toc_fixed_ocr.xhtml). For flowepub, there is one file (toc_flow.xhtml).
- page_list_{fixed_ocr,fixed_scan,flow}.xhtml - EPUB page list files. For fixepub, there exists two files, one for the page scan rendition (pagelist_fixed_scan.xhtml) and the second for the text OCR rendition (pagelist_fixed_ocr.xhtml). For flowepub, there is one file (pagelist_flow.xhtml).
- chapter_list_{fixed_ocr,fixed_scan,flow}.xhtml - EPUB chapter list files. For fixepub, there exists two files, one for the page scan rendition (chapterlist_fixed_scan.xthml) and the second for the text OCR rendition (chapterlist_fixed_ocr.xhtml). For flowepub, there is one file (chapterlist_flow.xhtml).
- hebxxxxx.xxxx.xxx.epub - archive of the epub directory described above.
- hebxxxxx.xxxx.xxx_metadata.csv - CSV file the contains the metadata for this book, including the monograph information, asset information, reviews.html, related.html, and the cover image.
-
hebxxxxx.xxxx.xxx.zip - zip file that can be used as a Fulcrum bundle for upload to create a new monograph. The following may be included:
- hebxxxxx.xxxx.xxx_metadata.csv
- hebxxxxx.xxxx.xxx-lg.jpg - book cover image.
- hebxxxxx.xxxx.xxx.epub
- related.html - if not empty.
- reviews.html - if not empty.
- Media assets - the book media assets listed in the assets.html file.
- epubcheck.xml - may exists as a result of an invocation of the check production. Contains the output from epubcheck validation.
The command for invoking any process is the ruby script hebepub found within the ROOTDIR/winterberry/script directory. The syntax for this command is:
hebepub production [hebDir…]
-
production
- bundle - generates a Fulcrum Monograph Bundle Archive.
- check - performs EPUB Structure Validation.
- clobber - removes HEB source directory files thus allowing it to be re-generated. See Clobber Source Files.
- convert - for fixepub, converts original HEB source page scan files from TIF/JP2 to PNG.
- epub - generates the HEB EPUB Archive.
- hebDir - path to 1 or more HEB EPUB source directories. If no directory is specified, then the current directory is assumed. If the specified directory does not exist, then a directory will be created.
This script sets required environment variables and traverses the list of specified HEB source directories, invoking a Rake task specified by the production parameter on each. See rakefile for the details.
Below are example invocations:
-
For the XML title hebxxxxx.xxxx.xxx, the following commands will generate the book EPUB archive file:
cd /quod-prep/prep/a/acls/hebepub target/bin/hebepub epub flowepub/books/hebxxxxx.xxxx.xxxThe resulting EPUB archive file is stored at flowepub/books/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.epub.
-
For the backlist title hebxxxxx.xxxx.xxx, the following commands will generate the book Fulcrum bundle file:
cd /quod-prep/prep/a/acls/hebepub target/bin/hebepub bundle fixepub/books/hebxxxxx.xxxx.xxxThe resulting bundle file is stored at
fixepub/books/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.zip. -
The following commands will rebuild the bundle file for backlist title hebxxxxx.xxxx.xxx:
cd /quod-prep/prep/a/acls/hebepub target/bin/hebepub clobber fixepub/books/hebxxxxx.xxxx.xxx target/bin/hebepub bundle fixepub/books/hebxxxxx.xxxx.xxx -
The following commands will invoke epubcheck on the specified EPUB source directory:
cd /quod-prep/prep/a/acls/hebepub target/bin/hebepub check fixepub/books/hebxxxxx.xxxx.xxxThe output from epubcheck is stored in the file
fixepub/books/hebxxxxx.xxxx.xxx/epubcheck.xml -
The following commands will convert page scan TIF images to PNG for backlist title hebxxxxx.xxxx.xxx:
cd /quod-prep/prep/a/acls/hebepub target/bin/hebepub convert fixepub/books/hebxxxxx.xxxx.xxxThe new PNG files are stored in the directory:
fixepub/books/hebxxxxx.xxxx.xxx/epub/OEBPS/images
The following sections describe the steps necessary to perform the HEB EPUB processes.
For fixepub books, original page scan files for the book are expected to reside in the following directory:
ROOTDIR/fixepub/dlxs/hebxxxxx.xxxx.xxx
where hebxxxxx.xxxx.xxx is the HEB ID assigned to the book. The file format is expected to be either TIF or JP2. The EPUB specification does not support TIF, so it is necessary to convert the original scans to PNG.
The conversion can be done by invoking the following commands:
cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub convert layout/hebxxxxx.xxxx.xxx
The new PNG files are stored in the images directory within the hebxxxxx.xxxx.xxx source directory.
Below are the process steps necessary to generate the HEB book EPUB archive. The following commands can be used to invoke this process and generate a HEB EPUB for the book with the ID hebxxxxx.xxxx.xxx:
cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub epub layout/hebxxxxx.xxxx.xxx
The book DLXS XML source is expected to be found at the path listed in the table below and copied to path hebxxxxx.xxxx.xxx_dlxs.xml.
|
Layout |
Source Path |
|
ROOTDIR/fixepub/dlxs/hebxxxxx.xxxx.xxx/hebxxxxx.xxxx.xxx.xml |
|
|
ROOTDIR/flowepub/dlxs/hebxxxxx.xxxx.xxx.xml |
The following resources are copied from their source paths into the METAINFSRC directory and referenced by the hebxxxxx.xxxx.xxx_tei.xml file:
|
Resource |
Source Paths |
|---|---|
|
ROOTDIR/resources/marc/hebxxxxx.xml |
|
|
ROOTDIR/resources/aclsdb/copyholder/hebxxxxx.xxxx.xxx.xml |
|
|
ROOTDIR/resources/aclsdb/related_title/hebxxxxx.xxxx.xxx.xml |
|
|
ROOTDIR/resources/aclsdb/reviews/hebxxxxx.xxxx.xxx.xml |
|
|
ROOTDIR/resources/aclsdb/series/hebxxxxx.xxxx.xxx.xml |
|
|
ROOTDIR/resources/aclsdb/subject/hebxxxxx.xxxx.xxx.xml |
The font files are found in the directory
ROOTDIR/flowepub/fonts
and copied into the fonts directory. The fonts.html file is generated during this step. For fixepub, currently there are no fonts to include. So, this file is empty.
The CSS stylesheet file(s) are found in the directory
ROOTDIR/layout/styles
where layout is either fixepub or flowepub and are copied into the styles directory. The stylesheets.html file is generated during this step.
Assets include book cover image/audio/video/PDF/etc files. These files use the book HEB ID as the prefix for the filename and are expected to reside in the directory:
ROOTDIR/resources/assets
No files are copied during this step. The assets.html file is generated.
Transform the DLXS XML file hebxxxxx.xxxx.xxx_dlxs.xml to the TEI XML file hebxxxxx.xxxx.xxx_tei.xml using the XSLT stylesheet hebdlxs2tei.xsl.
The cover image(s) are located at the path:
ROOTDIR/resources/assets/hebxxxxx.xxxx.xxx.jpg ROOTDIR/resources/assets/hebxxxxx.xxxx.xxx-lg.jpg (higher resolution, may not be present)
and copied in the images directory.
For fixepub only, all images found in the images directory, including cover images and page scan image files (see Process 1 are used as input for generating the images.html file.
The TEI XML file hebxxxxx.xxxx.xxx_tei.xml contains references to the resources (marc.xml, copyholder.html, related.html, reviews.html, series.html, subject.html), fonts (fonts.html), CSS stylesheets (stylesheets.html), and book assets (assets.html). It is used as the input to a XSLT transformation to produce the EPUB book content found in the OEBPS directory. The XSLT stylesheets hebtei2fixepub.xsl (fixepub) and hebtei2flowepub.xsl (flowepub) are used for the transformation.
For flowepub only, all images referenced by the TEI XML file hebxxxxx.xxxx.xxx_tei.xml are copied to the images directory. Then all images found in the images directory, including cover images, are used as input for generating the images.html file.
The hebxxxxx.xxxx.xxx.epub file is generated by zipping the contents of the epub directory.
Any HEB EPUB directory can be validated by invoking the following commands:
cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub check layout/books/hebxxxxx.xxxx.xxx
where layout is either fixepub or flowepub. The results can be found in the file
ROOTDIR/layout/books/hebxxxxx.xxxx.xxx/epubcheck.xml
To create a new monograph for a HEB EPUB, the hebxxxxx.xxxx.xxx.zip may be generated by invoking the following commands:
cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub bundle layout/hebxxxxx.xxxx.xxx
where layout is either fixepub or flowepub. Below are the process steps for generating the bundle. This process may invoke Process 2 if the hebxxxxx.xxxx.xxx.epub has not been previously generated.
The TEI XML file hebxxxxx.xxxx.xxx_tei.xml is used as the input to a XSLT transformation to produce the monograph metadata file hebxxxxx.xxxx.xxx_metadata.csv. The XSLT stylesheet hebtei2meta.xsl is used for the transformation.
The assets.html file is used to determine the assets that are to be uploaded to Fulcrum and listed on the monograph Media tab.
The hebxxxxx.xxxx.xxx.zip can be generated as the necessary files have been generated.
Links to existing Fulcrum Media asset pages may be added to an EPUB by performing the following steps:
- Download the asset CSV file from the Fulcrum monograph page.
- Rename the CSV using HEB ID, hebxxxxx.xxxx.xxx.csv, and store it in the ROOTDIR/resources/asset_links directory.
- Invoke the commands in the section Generate EPUB Archive to regenerate the EPUB archive.
- Replace or re-version the EPUB archive file asset on the Fulcrum monograph page.
To re-generate either a HEB EPUB or Fulcrum bundle, invoke the following commands:
cd /quod-prep/prep/a/acls/hebepub
target/bin/hebepub clobber layout/hebxxxxx.xxxx.xxx
This removes the necessary files to allow the HEB book source directory to be rebuilt from scratch. This removes all files except the following:
- hebxxxxx.xxxx.xxx_dlxs.xml - this file may have been modified by hand to generate this book. So this file is not removed.
- hebxxxxx.xxxx.xxx_dlxs_org.xml - if present, to preserve the original DLXS XML source.
For fixepub, the above files are not removed and also the following is not removed:
- images directory. Since often the original TIF files have been converted to PNG and this process can be time consuming, the page scans are not removed.
Below is a description of the ROOTDIR/target directory and the files contained within:
-
bin - contains bash shell script for invoking the genertion process.
- hebepub - script for invoking the epub/bundle generation process. The script iterates over the list specified HEB directories and invokes a series of rake tasks on each. It also sets a few environment specific variables (development or production) used by the rake tasks. See the script for more information.
-
layouts - contains files specific to a layout.
-
fixepub - contains files specific to the fixed page layout.
- styles - directory containing the CSS stylesheets to be included in the styles directory of a HEB EPUB archive.
- flowepub - contains files specific to the flowable page layout.
-
fixepub - contains files specific to the fixed page layout.
-
lib - contains files to support the above described scripts.
-
jars - directory contains the following Java jar files:
- epubcheck-jar-with-dependencies.jar - epubcheck jar file from w3c/epubcheck GitHub. provides support for the hebepub check production.
- hebimg-jar-with-dependencies.jar - provides support for the hebepub convert production. Includes support for image formats except jp2.
- hebimgjp2-jar-with-dependencies.jar - provides support for the hebepub convert production. Support for image formats except tif.
- hebxslt-jar-with-dependencies.jar - provides support for the hebepub epub | bundle productions. Invokes a XSLT 3.0 processor.
-
rake - directory contains the following Ruby and Rake task files:
- rakefile - top level Rake task file that implements the process productions described in section Process Invocation.
- acls.rake - Rake task file imported by rakefile that implements the dependencies for the ACLS provided files, including copyholder.html, related.html, reviews.html, series.html, and subject.html.
- assets.rake - Rake task file imported by rakefile that implements the dependencies for the assets.html file.
- fonts.rake - Rake task file imported by rakefile that implements the dependencies for the fonts.html file.
- Gemfile - Gem file listed required gems.
- styles.rake - Rake task file imported by rakefile that implements the dependencies for the stylesheets.html file.
- common.rb - sets the value of path variables that are shared by the rest of the Ruby and Rake task files, such paths to the EPUB source files and the target directory files.
- AssetListener.rb - Ruby class file implementing a listener used during parsing of the assets.html and links.html files.
- EmptyListener.rb - Ruby class file implementing a listener used for parsing related.html and reviews.html to determine if they are empty. If so, these files are not included in the Fulcrum bundle hebxxxxx.xxxx.xxx.zip.
-
xsl - directory contains the following XSLT files:
- hebdlxs2tei.xsl - transforms DLXS XML file hebxxxxx.xxxx.xxx_dlxs.xml to TEI XML file hebxxxxx.xxxx.xxx_tei.xml. See process Step 2.6 - Generate TEI XML File.
- hebtei2fixepub.xsl - transforms TEI XML filehebxxxxx.xxxx.xxx_tei.xml to fixed layout HEB Book Source Structure. See process Step 2.9 - Generate EPUB Structure.
- hebtei2flowepub.xsl - transforms TEI XML filehebxxxxx.xxxx.xxx_tei.xml to flowable layout HEB Book Source Structure. See process Step 2.9 - Generate EPUB Structure.
- hebtei2meta.xsl - transforms TEI XML filehebxxxxx.xxxx.xxx_tei.xml to the Fulcrum monograph metadata CSV file hebxxxxx.xxxx.xxx_metadata.csv.
- heblib.xsl - shared XSL file that sets constants used by the above XSLT files.
- heblibtei.xsl - shared XSL file that sets variables used by the XSLT files hebtei2fixepub.xsl, hebtei2flowepub.xsl, and hebtei2meta.xsl.
- hebtei2epub.xsl - shared XSL file that sets variables and defines templates used by the XSLT files hebtei2fixepub.xsl, hebtei2flowepub.xsl.
-
jars - directory contains the following Java jar files: