20130122 converting a mess of html docs into a pdf - plembo/onemoretech GitHub Wiki

title: Converting a mess of html docs into a pdf link: https://onemoretech.wordpress.com/2013/01/22/converting-a-mess-of-html-docs-into-a-pdf/ author: lembobro description: post_id: 4124 created: 2013/01/22 14:00:59 created_gmt: 2013/01/22 18:00:59 comment_status: closed post_name: converting-a-mess-of-html-docs-into-a-pdf status: publish post_type: post

Converting a mess of html docs into a pdf

I've got a mess of html page dumps from an Openfire XMPP server admin console that I want to combine together into a .pdf document for posterity. Here's what I did. 1. The pages were captured using the File... Save Page As... function in Firefox. After saving every page under every tab I had 60 .html pages in all (actually ".jsp.html", Openfire is a Java application). 2. Made a list of those files sorted in order of their creation from earliest to latest, with one file name per line:

ls -tr1 *.html >filelist.txt

3. Wrote a little shell script convert the files to pdf using the htmldoc utility, calling it "htmldoc-pdf.sh":

#!/bin/bash
# htmldoc-pdf.sh Converts saved web pages to pdf
# Created 1/22/2013 by P Lembo
/usr/bin/htmldoc 
--outfile output.pdf 
--webpage 
--landscape 
--format pdf 
--embedfonts 
--no-links 
--size letter 

5. Removed the newlines from the file list and replaced them with whitespace characters:

perl -pi -e 's/n/ /g' filelist.txt

6. Appended the script file with the list of file names:

cat filelist.txt >>html2pdf.sh

7. Edited the script so it looked something like this:

#!/bin/bash
/usr/bin/htmldoc 
--outfile output.pdf 
--webpage 
--landscape 
--format pdf 
--embedfonts 
--no-links 
--size letter 
index.jsp.html server-properties.jsp.html ...

8. Ran the script in the directory where all the .html files were located. 9. Opened the resulting file in my favorite pdf reader. To be honest the output I got was serviceable but contained some pages where text was cut off along the right hand margin (one of the reasons I chose the "--landscape" option was to try and avoid that), so this is still a very experimental procedure.

Copyright 2004-2019 Phil Lembo