CurrentStatus - smilingthax/unpdf GitHub Wiki
The library can currently read most in-the-wild PDFs.
- it lacks support for Xref-streams and object-streams
sel/pdfselect
is a simple example that extracts pages from pdf files
(might produce invalid pdf's when used with AES-encrypted files...)rletest/rletest
comes very handy for testing the various filterspdfget/pdfget
is a command-line tool to extract single objects from a pdf
Library dependencies:
- openssl
for AES, RC4, MD5 and RAND. Can be easily replaced [...license], seepdfcrypt.h
. - zlib
- jpeg
- g4coder/lzwcoder (included)
Documentation:
- this page :-)
- source code (sorry, only few comments there; but quite readable)
have a look at CodingConventions, you also might want to use vim withfoldmethod=marker
- Adobe's PDF-Reference
The goals of the library are:
-
low-level mapping of pdf-Objects to C++-Objects, but not necessarily one-to-one
It is more important to have a well thought out interface than functionality.
-
reading and writing of pdf-files
-
speed
If you just want a PDF-Output for your Application, this is not the right choice. "printf()-based" PDF-generation will be a lot faster for you, and there are already numerous libraries doing exactly that.
-
No unexpected errors, especially no Segmentation faults
Better one
assert()
slightly too strict (and then thinking again when it triggers) than assuming that code will do the right thing
Future:
- The library does currently work for my purposes, so I probably won't spend much time coding on it in the next months
- Some parts are quite usable, while others are virtually non-existent
So if this library does make sense to you, you are welcome to join the project.
I will help at serious questions (meaning: you have read the source code) where documentation is lacking.
I'm also interested in ideas, discussions and contributions leading to good API design.
But actually implementing the stuff is up to you.
My Email: smilingthax googlemail.com
Possible next steps:
- come up with some general principles, on how to treat the read path (PDF) and the write path (OutPDF) -- meaning:
what should be unified, and what should be kept separate. pdfput
that complementspdfget
(probably done together with the preceding point)- implement the remaining
/Filter
- JBIG2 (using jbig2dec / jbig2enc)
(if you have too much time: reimplement jbig2enc in vanilla C and submit the code for inclusion in leptonlib) - JPX
- Crypt
- JBIG2 (using jbig2dec / jbig2enc)
class Image
- some initial fragments are in
rletest/main.c
- you might first think about some use-cases and a sensible interface before you design the internal class structure
- some initial fragments are in
- rethink the error-handling: all these
throw UsrError("xy is not right")
- fix throws through C code
some Filters use output callbacks (e.g.writefunc_Output
inpdffilter.cpp
); when an exception is thrown there, it will not correctly pass thru the C library code, so this has to be fixed, probably catched and newly thrown when reentereing C++ code (if we want to preserve the error text) - Documentation, e.g. doxygen
- Fonts
one should probably make a stand-alone(no freetype/t1lib dependency, MIT license?) C library, possibly including a decent C++ wrapper, that can be used in any Application where Font-embedding and -subsetting generating PS or PDF. This could be of great benefit for other Open Source Software. UPDATE: I've started one as part of my Google Summer of Code 2008 Project: It's in the opfc svn: fontembed - "the other stuff":
- autoconf (?)
- automated Unit- and Conformance-testing
- Homepage, "marketing", attract developers, packaging for distros
- make handy tools: pdfselect, pdf... see ps...
- build a pdf viewer or pdf editor based on this library :-)