Pdf Parsing Benchmark - trankhoidang/RAG-wiki GitHub Wiki
✏️️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 24/06/2024
📥 Last Update: 01/07/2024
Context: lacks of comprehensive comparison of different up-to-date PDF parsing tools in term of text extraction, table extraction and image extraction.
TLDR: For text and image extraction, compare quantitatively some tools based on small dataset of 5-8 PDFs with annotations. For table extraction, compare qualitatively some tools over 3 PDFs without annotations. Tools are either rule-based parsers or RAG-oriented parsers.
Pre-words:
- limited setup
- might contain errors
- to read and interpret with caution
- code from 3 May 2024
Specifically, the method used may be not robust, it gives a view on the problem and can be not exact. This is due to the fact that many tools investigated return the output with different format, which is not perfectly comparable to the ground truth prepared.
Also, post-processing for some tools is used but does not guarantee a perfect matching (between the true performance and the ground truth).
The dataset for text extraction consisted of 8 PDFs, primarily scientific articles, and their corresponding ground-truth text files. These PDFs were selected to provide a range of formats and complexities typical in scientific literature. The ground-truth texts were accurately transcribed versions of the PDFs to ensure a reliable benchmark. The dataset is taken from [1].
Four metrics (inspired by [73]) were used to evaluate the performance of the text extraction tools:
- Execution time: The time taken to process each PDF and extract the text.
- Levenshtein ratio: Measures the similarity between two strings by calculating the minimum number of single-character edits required to change one string into the other.
- Cosine similarity (Word Count): Calculates the cosine of the angle between two vectors of word counts, providing a measure of similarity between the extracted text and the ground truth.
- TF-IDF similarity: Evaluates the similarity between the extracted text and the ground truth based on term frequency-inverse document frequency (TF-IDF) values.
- Extraction: Each tool was used to extract text from the 8 PDFs.
- Post-processing: Post-processing functions were applied to ensure the extracted text was in a comparable format to the ground truth. Some processing functions were taken from [64] and [71].
- Comparison: The extracted text was compared to the ground-truth text using the four metrics mentioned above.
- Averaging: The scores for each metric were averaged across all 8 PDFs to provide a comprehensive evaluation of each tool’s performance.
The results of the text extraction comparison are shown in the table below. Tools are compared based on execution time, Levenshtein ratio, cosine similarity, and TF-IDF similarity. The top two best-performing tools in terms of execution time are highlighted in green, while the top two least-performing tools are highlighted in red. Note that LlamaParse's execution time is highly variable, ranging from 4 seconds to 100 seconds, depending on server-side conditions.
Text extraction comparison (seconds per PDF) of different parsers, top2-best-performant are highlighted in green while top-2-least performant are highlighted in red.
Extraction Tool | Executed Time | Levenshstein Sim Score | TFIDF SimScore | Cosine Sim Score |
---|---|---|---|---|
llamaparse* | 65.40 | 0.84 | 0.93 | 0.99 |
llmsherpa* | 9.14 | 0.93 | 0.94 | 1.00 |
pdfminer.six | 1.89 | 0.92 | 0.86 | 0.98 |
pdfplumber | 3.09 | 0.80 | 0.76 | 0.95 |
pymupdf | 0.05 | 0.97 | 0.92 | 0.99 |
pypdf | 1.05 | 0.97 | 0.96 | 1.00 |
pypdfium | 0.04 | 0.98 | 0.97 | 1.00 |
unstructured* | 2.82 | 0.89 | 0.86 | 0.98 |
The tool with * need to be interpreted with caution because its performance IS under-estimated due to format mismatch between output and ground-truth. Llamaparse execution time depends heavily on the server side, which may vary from 4 seconds to 100 seconds.
The code and the benchmark method is inspired from:
- [1] py-pdf/benchmarks: Benchmarking PDF libraries (github.com)
- [2] Comparing 4 methods for pdf text extraction in python | by Jeanna Schoonmaker | Social Impact Analytics | Medium
- [3] LinkTime-Corp/llm-in-containers: Run LLM-related tools in containers. (github.com)
As of helper code,
- post-processing functions for text extraction (pdftotext and pdfium) from [1].
- functions to compute tfidf-similarity and cosine similarity of 2 strings from [2].
- functions to combine different blocks into txt for llmsherpa from [3].
The dataset for image extraction consisted of 5 PDFs (also from [64]), primarily scientific articles. Because some libraries only extract images without providing their positions within the PDF, the initial approach was not to annotate the bounding boxes around the images but to capture the images themselves.
Because some libraries only extract images without providing their positions within the PDF, the initial approach is not to annotate the bounding boxes around the images but to capture the images themselves. These images may appear alone (without captions), with captions, or as combinations of multiple side-by-side subfigures.
Thus, the annotations of images within the PDFs are labelled in 3 ways:
- Type 1: Close capture (subfigure level / figure level)
- Type 2: Capture including figure caption (figure level)
- Type 3: Capture of combinations of subfigures (subfigure level / figure level)
with also:
- A .txt file given the number of all 'unit' (type 1) images to extract
- A .csv detailing the weight (how many unit images within) associated to images of type 2 or type 3.
In the left, example of original pdf, in the right the three types of annotations under .png format.
The performance of the image extraction tools was evaluated using four metrics:
- Execution time: Time taken to process each PDF and save the extracted images.
- Accuracy: Proportion of correctly extracted images compared to the total number of images to extract.
- Precision: Proportion of correct images out of all extracted images.
- Critical error: Number of times the tool failed due to execution errors or returned 0 images.
- Each tool is handled to extract images and saves extracted images locally
- The list of images extracted is compared to the list of ground truth
- Each extracted image is compared to each ground-truth image
- If similar, correct extraction
- If not, pass
Similarity of images is based on the Hamming distance between average hash of 8x8 monochrome thumbnail of extracted image and ground-truth image.
If difference < 8 bits, similar
If not, different
Alternatives hashing method already tested include:- pHash using DCT to evaluate frequency patterns
- dHash evaluate gradients
- wHash using DWT
- crop_resistant hash (with dHash)
- Preprocessing of images before similarity comparison
- If images extracted contains big white space, crop images to ROI only
- using opencv transformations + contour detection
- If images irrelevant (too small, logos, …), not included in comparison
- too small in term of absolute size ( < 10kb)
- too small in term of height and width
- too big ratio between height and width
- logo images detected by zero-shot CLIP classification
- If images extracted contains big white space, crop images to ROI only
- Each extracted image is compared to each ground-truth image
Llamaparse execution time depends heavily on the server side, which may vary from 4 seconds to 100 seconds. Unstructured.io do text/table/image extraction in the same time.
{width="100%"}
The dataset for table extraction included 3 PDFs of varying lengths and content types:
- A scientific article with schemas, tables, and figures spanning 12 pages.
- A technical document of 46 pages containing numerous schemas and few tables.
- A project report document of 81 pages, featuring both schemas and tables.
The tools compared in this study were: table-transformers [tableTransformer] and table-structure-recognition [table_structure_reg], tabula-py [tabulaPy], camelot [camelot], image2table [img2table] (with tesseract-ocr), LlamaParse [llamaparse], and Unstructured [Unstructured]. It is noteworthy that Unstructured (version open-source) and LlamaParse (API service) are emerging tools in the RAG community for parsing unstructured data like PDFs.
The tools were evaluated based on their support for table detection/localization (TD), table structure recognition (TSR), and table-to-HTML/data frame conversion. Each tool was used to process all pages in the PDFs, and the detected table areas were visualized using bounding boxes or table joints. If a tool supported further functionalities like table-to-HTML or data frame conversion, these were also displayed.
The comparison was conducted manually by visualizing each page of each PDF. No quantitative metrics were used for the comparison of the results. However, due to the small dataset size, it was manageable to capture and compare errors in detected tables across all tools.
Subjective analyses based on the visual comparison of each tool's results on the set of three PDFs studied are given in Table 4.5. For implementation details, the basic guidelines of each tool were followed whenever possible. Specific table formats were not assumed for detection (specific techniques for detection knowing table format are not applied).
Tool | Performance Analysis |
---|---|
table-transformers | Exhibited poor detection of column headers and sometimes identified white space as tables. Misidentified the table of contents, which is not a desired table to detect. Overall performance was not satisfactory. |
table-structure-recognition | Performed very poorly in the evaluation. Detection and extraction of tables were not accurate, resulting in generally subpar performance. |
tabula-py | Showed very bad precision, especially with lists, and poor recall (for tables in the project report). Misidentifies lists as tables, leading to incorrect column header extraction and large tables with empty values. Text extraction was also poor, and it did not work well with borderless tables. |
camelot | Demonstrated good precision and recall, except for one instance of misidentification. Column header and text extraction were good. Performance decreased without the lattice option for implicit rows and did not support borderless tables. Provided position and page information but had a long inference time (15 seconds to 2 minutes per PDF). |
image2table with tesseract-ocr | Showed good precision and recall. Sometimes failed to properly capture or separate column headers. Text extraction was good. Supported implicit rows and borderless tables, though this could reduce precision. Provided position and page information with a normal inference time of about 6 seconds per PDF. |
LlamaParse | Without parsing instructions, it misidentifies schemas, histograms, or complicated graphs as tables. With custom parsing instructions, it shows good precision and recall. Only fails in cases where the table spans multiple pages or has unusual formatting. |
Unstructured | Showed quite good performance overall. |