Use - smalot/pdfparser GitHub Wiki

This sample will parse all the pdf file and extract text from each page.

<?php
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
$text = $pdf->getText();
echo $text;
 
?>

You can too extract text from each page handly or for a specific page.

<?php
  
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
// Retrieve all pages from the pdf file.
$pages  = $pdf->getPages();
 
// Loop over each page to extract text.
foreach ($pages as $page) {
    echo $page->getText();
}
 
?>

Here a sample code to extract metadata from document (Author, Creator, CreationDate, ...).

<?php
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
// Retrieve all details from the pdf file.
$details  = $pdf->getDetails();
 
// Loop over each property to extract values (string or array).
foreach ($details as $property => $value) {
    if (is_array($value)) {
        $value = implode(', ', $value);
    }
    echo $property . ' => ' . $value . "\n";
}
 
?>

Note: The demo also uses the nl2br function. This function helps in maintaining a similar line layout in the pdf file.

Config

To manipulate the behavior of the parser, you can pass a Config object to the constructor of the parser.


$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$config->setRetainImageContent(false);
$config->setHorizontalOffset('');

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf    = $parser->parseFile('document.pdf');
 
$text = $pdf->getText();
Option Type Default Description
setFontSpaceLimit Integer -50 Changing font space limit can be helpful when getText() returns a text with too many spaces.
setHorizontalOffset String When words are broken up or when the structure of a table is not preserved, you can use setHorizontalOffset.
setPdfWhitespaces String \0\t\n\f\r
setPdfWhitespacesRegex String [\0\t\n\f\r ]
setRetainImageContent Boolean true
setDecodeMemoryLimit Integer 0 If parsing fails because of memory exhaustion, you can use the following options.