Scrubbing HTML from epub files - Paperight/website GitHub Wiki

In order to use the automated HTML-PDF converter on paperight.com/server, you will need to extract HTML from epubs provided by publishers or sourced from Gutenberg.

Extracting HTML from epub:

  1. Open the .epub file in Sigil. Delete the wrapper file. Merge the separated HTML files, by using the “merge with previous” function. You can find this by right-clicking on the second from first document in the HTML list. Save the changes and close Sigil.
  2. Right click on the file name to re-name. Change the file extension to .zip.
  3. “Extract all” from the zipped folder.
  4. In the extracted file, find the HTML in the text folder. Rename according to the correct file naming conventions (isbn_title_date), and move from the extracted folder to your working folder.

Basic scrub:

  1. Remove licensing information from the beginning and end of the html.
  2. Remove <meta content> tags at the beginning of the text that refers to Gutenberg, or the Gutenberg epub css.
  3. Remove <style> tags that may affect the Paperight css file (these should be above the tag), unless these provide semantic meaning.

Body scrub:

  1. Make sure that you've uploaded the cover image to the jacket_images folder on the media.paperight.com FTP (the public FTP).

If your HTML includes a <front> element between

and , you must convert it to a <div class="front"> and move it inside the <body> element.

As the first element in <body>, in <div class="front">, include a link to the cover image, place it in a <div> and give it the class="cover".

e.g.

<div class="front"> <div class="cover"> <img alt="The Tragedy of Mariam cover" src="http://media.paperight.com/content/jacket_images/the-tragedy-of-mariam_cary_cover_20140107.jpg"/> </div> </div><!--.front-->

  1. In <body>, directly after the <body> tag and <div class="front"> element, wrap all the content in a div classed paperight-ed-content (this is for page numbering function in the css). NB: Dont forget to close the div at the end of the document (see below).

e.g.

<body> <div class="paperight-ed-content"> .......... </div><!--.paperight-ed-content--> </body> </html>

  1. Edit tags:

Title: <h1 class="title">Jacob's Room</h1>

Author: <p class="author">Virginia Woolf</p>

Chapter: <h2 class="chapter-number">Chapter 1</h2>

<p class="chapter-title"></p>

Bodytext-first: <p class="bodytext-first"></p>*

Bodytext: <p></p>

  1. Remove <hr /> elements

  2. Run HTML through W3C validator: http://validator.w3.org/check

Footnotes:

To create a footnote, use a span with the class "fn". The text of the footnote should be placed in the text, directly after the location of the footnote.

e.g.

<p> Footnotes<span class="fn">A footnote is a note placed at the bottom of a page of a book or manuscript that comments on or cites a reference for a designated part of the text.</span> are essential in printed documents and Prince knows how to generate them. Most readers will read the footnotes before they read the text from where the footnotes are anchored<span class="fn">Often, the most interesting information is found in the footnotes.</span>. </p>

⚠️ **GitHub.com Fallback** ⚠️