Use - smalot/pdfparser GitHub Wiki
This sample will parse all the pdf file and extract text from each page.
<?php
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;
?>
You can too extract text from each page handly or for a specific page.
<?php
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
// Retrieve all pages from the pdf file.
$pages = $pdf->getPages();
// Loop over each page to extract text.
foreach ($pages as $page) {
echo $page->getText();
}
?>
Here a sample code to extract metadata from document (Author, Creator, CreationDate, ...).
<?php
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
// Retrieve all details from the pdf file.
$details = $pdf->getDetails();
// Loop over each property to extract values (string or array).
foreach ($details as $property => $value) {
if (is_array($value)) {
$value = implode(', ', $value);
}
echo $property . ' => ' . $value . "\n";
}
?>
Note: The demo also uses the nl2br function. This function helps in maintaining a similar line layout in the pdf file.
Config
To manipulate the behavior of the parser, you can pass a Config
object to the constructor of the parser.
$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$config->setRetainImageContent(false);
$config->setHorizontalOffset('');
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
Option | Type | Default | Description |
---|---|---|---|
setFontSpaceLimit |
Integer | -50 |
Changing font space limit can be helpful when getText() returns a text with too many spaces. |
setHorizontalOffset |
String | |
When words are broken up or when the structure of a table is not preserved, you can use setHorizontalOffset . |
setPdfWhitespaces |
String | \0\t\n\f\r |
|
setPdfWhitespacesRegex |
String | [\0\t\n\f\r ] |
|
setRetainImageContent |
Boolean | true |
|
setDecodeMemoryLimit |
Integer | 0 |
If parsing fails because of memory exhaustion, you can use the following options. |