PdfDocument - UglyToad/PdfPig GitHub Wiki

Namespace - UglyToad.PdfPig

The PdfDocument class provides all root functionality for consuming document content.

To create an instance of a PdfDocument you must first call PdfDocument.Open. There are 3 overloads for opening a document:

PdfDocument Open(byte[] fileBytes, ParsingOptions options = null);

This opens a document from an array of bytes representing a PDF document.

PdfDocument Open(string filePath, ParsingOptions options = null);

This opens a document from the filesystem at the provided path. This will load the entire file into memory at once. The alternative is to use the 3rd overload:

PdfDocument Open(Stream stream, ParsingOptions options = null);

This opens a document from a stream of any kind, this could be a MemoryStream, FileStream, etc. It's worth noting that if the stream is not buffered (e.g. a network stream) this will be much slower. One workaround for this is to load the stream into a BufferedStream, a framework class which enables buffering automatically.

Any call to open should be wrapped in a using statement since PdfDocument implements IDisposable:

using (PdfDocument document = PdfDocument.Open(@"C:\docs\test.pdf"))
{
}

Parsing Options

Parsing options control aspects of how the document is opened and allow the consumer to provide their own logger. The defaults should be sufficient, except where the document is password protected where a password must be provided in the ParsingOptions.Password property.

UseLenientParsing controls how strictly the library interprets the PDF specification and how much error recovery it attempts where the document format is invalid or corrupt. The default is to attempt lenient parsing but a stricter parsing mode can be enabled by passing the static ParsingOptions.LenientParsingOff instance.

Pages

Once a PdfDocument has been obtained by calling Open the main use case is to inspect the pages that the document contains.

Firstly the total number of pages in the document is provided by:

int numberOfPages = document.NumberOfPages;

Individual pages may then be opened using GetPage. This takes a 1-indexed page number as an argument:

using UglyToad.PdfPig.Content;
// ...
Page page1 = document.GetPage(1);
Page page2 = document.GetPage(2);
// etc.

Calling GetPage(i) with a value of i <= 0 is invalid.

You can also enumerate all pages in a document in order using:

using UglyToad.PdfPig.Content;
// ...
IEnumerable<Page> pages = document.GetPages();

XMP Metadata

A PDF document can include general information about the document at the top level in the XML format defined by the Extensible Metadata Platform (XMP).

If this optional XML data is present it may be obtained using the TryGetXmpMetadata method:

using UglyToad.PdfPig.Content;
// ...
if (document.TryGetXmpMetadata(out XmpMetadata metadata))
{
	XDocument xmpDocument = metadata.GetXDocument();
}
else
{
	// No XMP metadata was present.
}

Document Information

In addition to XMP metadata which allows for an extensible range of metadata a PDF document may optionally contain an information dictionary. This defines a range of fields such as author, title, etc.

This can be accessed through the Information property:

using UglyToad.PdfPig.Content;
// ...
DocumentInformation information = document.Information;
string title = information.Title;
string author = information.Author;
// etc.

Since all fields on the information dictionary are optional they can be null and should be checked prior to access, e.g.:

DocumentInformation information = document.Information;
if (information.Author != null) 
{
	string upperAuthor = information.Author.ToUpper();
}

Version

There are multiple versions of the PDF specification following the numbering 1.1, 1.2, 1.3, etc.. The version number of the current document can be retrieved with the Version property:

decimal version = document.Version;

IsEncrypted

Documents can be encrypted using a number of different algorithms defined by the PDF specification, the IsEncrypted flag indicates whether a document is encrypted.

Structure

The Structure property of a document provides access to the underlying PDF tokens that are used to construct the document.

This is for advanced users and relies on a familiarity with the PDF specification to use.