PDF analysis tools - fredzannarbor/pagekicker-community GitHub Wiki

PageKicker includes several tools for extended analysis of PDF documents and collections of documents.

montageur.sh

montageur is a preliminary effort at creating a visual "gist" of a PDF document. The program extracts all the images from a PDF, discards very small ones under a 1000-byte-size threshold (these are usually typographic elements such as bullets and dingbats), then a) creates a poster-size montage of all images and b) a smaller graphic containing the top "n" images by descending image size, which is a crude (but actually pretty effective) proxy for relevance.

Location: scripts/bin/montageur.sh

Requires: pdfimages, fdupes, pdftk, imagemagick

Syntax:

montageur.sh --pdfinfile --maximages --outfile --montageurdir --tmpdir --environment --tmpdir --passuuid

pdfinfile must be supplied explicitly, all others have reasonable defaults

outputs:

  • all unique jpegs from PDF plus a zip of same are found in montageurdir
  • montage.jpg
  • montagetopn.jpg

decimator.sh

Decimator is a script that runs against any PDF (lengthy ones are assumed) and creates a 10-slide presentation-style deck with a mix of images and text. The goal is to reduce any PDF, however lengthy and technical, to a simple deck that can be paged through in 5 minutes or less. This is very rough, primarily because I have not solved the issues involved in generating pretty output slides. I tried using LaTex but its defaults are heavily tilted towards portrait with wide margins, which is pretty much the opposite of what's needed here. We could do the text in imagemagick but that requires a lot of time-consuming tweaking that I have not yet been able to prioritize.

Decimator accepts either local PDF files or an URL to a PDF file as input.

Location: scripts/bin/decimator.sh

Requires: imagemagick, pdftk, pdftotext, ebook-convert, Cmdflesh.jar

Syntax:

--decimator.sh --pdfinfile --pdfurl --reporttitle --tldr --outdir --passuuid

Either pdfinfile or pdfurl is required.

tldr is a user-provided nutshell summary of the PDF that appears on a slide by itself.