Pdf Parsing Benchmark - trankhoidang/RAG-wiki GitHub Wiki

PDF Parsing Benchmarking (Simple) - May 2024

✏️️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 24/06/2024

📥 Last Update: 01/07/2024

Context: lacks of comprehensive comparison of different up-to-date PDF parsing tools in term of text extraction, table extraction and image extraction.

TLDR: For text and image extraction, compare quantitatively some tools based on small dataset of 5-8 PDFs with annotations. For table extraction, compare qualitatively some tools over 3 PDFs without annotations. Tools are either rule-based parsers or RAG-oriented parsers.

Pre-words:

  • limited setup
  • might contain errors
  • to read and interpret with caution
  • code from 3 May 2024

Specifically, the method used may be not robust, it gives a view on the problem and can be not exact. This is due to the fact that many tools investigated return the output with different format, which is not perfectly comparable to the ground truth prepared.

Also, post-processing for some tools is used but does not guarantee a perfect matching (between the true performance and the ground truth).

Text Extraction Comparison

Dataset Used

The dataset for text extraction consisted of 8 PDFs, primarily scientific articles, and their corresponding ground-truth text files. These PDFs were selected to provide a range of formats and complexities typical in scientific literature. The ground-truth texts were accurately transcribed versions of the PDFs to ensure a reliable benchmark. The dataset is taken from [1].

Metrics Used

Four metrics (inspired by [73]) were used to evaluate the performance of the text extraction tools:

  1. Execution time: The time taken to process each PDF and extract the text.
  2. Levenshtein ratio: Measures the similarity between two strings by calculating the minimum number of single-character edits required to change one string into the other.
  3. Cosine similarity (Word Count): Calculates the cosine of the angle between two vectors of word counts, providing a measure of similarity between the extracted text and the ground truth.
  4. TF-IDF similarity: Evaluates the similarity between the extracted text and the ground truth based on term frequency-inverse document frequency (TF-IDF) values.

Methodology

  1. Extraction: Each tool was used to extract text from the 8 PDFs.
  2. Post-processing: Post-processing functions were applied to ensure the extracted text was in a comparable format to the ground truth. Some processing functions were taken from [64] and [71].
  3. Comparison: The extracted text was compared to the ground-truth text using the four metrics mentioned above.
  4. Averaging: The scores for each metric were averaged across all 8 PDFs to provide a comprehensive evaluation of each tool’s performance.

Results

The results of the text extraction comparison are shown in the table below. Tools are compared based on execution time, Levenshtein ratio, cosine similarity, and TF-IDF similarity. The top two best-performing tools in terms of execution time are highlighted in green, while the top two least-performing tools are highlighted in red. Note that LlamaParse's execution time is highly variable, ranging from 4 seconds to 100 seconds, depending on server-side conditions.

Comparison of different tools for text extraction task

Text extraction comparison (seconds per PDF) of different parsers, top2-best-performant are highlighted in green while top-2-least performant are highlighted in red.

Extraction Tool Executed Time Levenshstein Sim Score TFIDF SimScore Cosine Sim Score
llamaparse* 65.40 0.84 0.93 0.99
llmsherpa* 9.14 0.93 0.94 1.00
pdfminer.six 1.89 0.92 0.86 0.98
pdfplumber 3.09 0.80 0.76 0.95
pymupdf 0.05 0.97 0.92 0.99
pypdf 1.05 0.97 0.96 1.00
pypdfium 0.04 0.98 0.97 1.00
unstructured* 2.82 0.89 0.86 0.98

The tool with * need to be interpreted with caution because its performance IS under-estimated due to format mismatch between output and ground-truth. Llamaparse execution time depends heavily on the server side, which may vary from 4 seconds to 100 seconds.

The code and the benchmark method is inspired from:

As of helper code,

  • post-processing functions for text extraction (pdftotext and pdfium) from [1].
  • functions to compute tfidf-similarity and cosine similarity of 2 strings from [2].
  • functions to combine different blocks into txt for llmsherpa  from [3].

Image Extraction Comparison

Dataset Used

The dataset for image extraction consisted of 5 PDFs (also from [64]), primarily scientific articles. Because some libraries only extract images without providing their positions within the PDF, the initial approach was not to annotate the bounding boxes around the images but to capture the images themselves.

Annotations

Because some libraries only extract images without providing their positions within the PDF, the initial approach is not to annotate the bounding boxes around the images but to capture the images themselves. These images may appear alone (without captions), with captions, or as combinations of multiple side-by-side subfigures.

Thus, the annotations of images within the PDFs are labelled in 3 ways:

  • Type 1: Close capture (subfigure level / figure level)
  • Type 2: Capture including figure caption (figure level)
  • Type 3: Capture of combinations of subfigures (subfigure level / figure level)

with also:

  • A .txt file given the number of all 'unit' (type 1) images to extract
  • A .csv detailing the weight (how many unit images within) associated to images of type 2 or type 3.

example-benchmark-annotation-image

In the left, example of original pdf, in the right the three types of annotations under .png format.

Metrics Used

The performance of the image extraction tools was evaluated using four metrics:

  1. Execution time: Time taken to process each PDF and save the extracted images.
  2. Accuracy: Proportion of correctly extracted images compared to the total number of images to extract.
  3. Precision: Proportion of correct images out of all extracted images.
  4. Critical error: Number of times the tool failed due to execution errors or returned 0 images.

Workflow

  • Each tool is handled to extract images and saves extracted images locally
  • The list of images extracted is compared to the list of ground truth
    • Each extracted image is compared to each ground-truth image
      • If similar, correct extraction
      • If not, pass
        Similarity of images is based on the Hamming distance between average hash of 8x8 monochrome thumbnail of extracted image and ground-truth image.
        If difference < 8 bits, similar
        If not, different
        Alternatives hashing method already tested include:
        • pHash using DCT to evaluate frequency patterns
        • dHash evaluate gradients
        • wHash using DWT
        • crop_resistant hash (with dHash)
    • Preprocessing of images before similarity comparison
      • If images extracted contains big white space, crop images to ROI only
        • using opencv transformations + contour detection
      • If images irrelevant (too small, logos, …), not included in comparison
        • too small in term of absolute size ( < 10kb)
        • too small in term of height and width
        • too big ratio between height and width
        • logo images detected by zero-shot CLIP classification

Results

Comparison of different tools for image extraction task, in terms of execution time

Llamaparse execution time depends heavily on the server side, which may vary from 4 seconds to 100 seconds. Unstructured.io do text/table/image extraction in the same time.

Comparison of different tools for image extraction task, in terms of accuracy and precision{width="100%"}

Table Extraction Comparison

Dataset Used

The dataset for table extraction included 3 PDFs of varying lengths and content types:

  • A scientific article with schemas, tables, and figures spanning 12 pages.
  • A technical document of 46 pages containing numerous schemas and few tables.
  • A project report document of 81 pages, featuring both schemas and tables.

List of Tools Compared

The tools compared in this study were: table-transformers [tableTransformer] and table-structure-recognition [table_structure_reg], tabula-py [tabulaPy], camelot [camelot], image2table [img2table] (with tesseract-ocr), LlamaParse [llamaparse], and Unstructured [Unstructured]. It is noteworthy that Unstructured (version open-source) and LlamaParse (API service) are emerging tools in the RAG community for parsing unstructured data like PDFs.

Methodology

The tools were evaluated based on their support for table detection/localization (TD), table structure recognition (TSR), and table-to-HTML/data frame conversion. Each tool was used to process all pages in the PDFs, and the detected table areas were visualized using bounding boxes or table joints. If a tool supported further functionalities like table-to-HTML or data frame conversion, these were also displayed.

The comparison was conducted manually by visualizing each page of each PDF. No quantitative metrics were used for the comparison of the results. However, due to the small dataset size, it was manageable to capture and compare errors in detected tables across all tools.

Results Analysis

Subjective analyses based on the visual comparison of each tool's results on the set of three PDFs studied are given in Table 4.5. For implementation details, the basic guidelines of each tool were followed whenever possible. Specific table formats were not assumed for detection (specific techniques for detection knowing table format are not applied).

Tool Performance Analysis
table-transformers Exhibited poor detection of column headers and sometimes identified white space as tables. Misidentified the table of contents, which is not a desired table to detect. Overall performance was not satisfactory.
table-structure-recognition Performed very poorly in the evaluation. Detection and extraction of tables were not accurate, resulting in generally subpar performance.
tabula-py Showed very bad precision, especially with lists, and poor recall (for tables in the project report). Misidentifies lists as tables, leading to incorrect column header extraction and large tables with empty values. Text extraction was also poor, and it did not work well with borderless tables.
camelot Demonstrated good precision and recall, except for one instance of misidentification. Column header and text extraction were good. Performance decreased without the lattice option for implicit rows and did not support borderless tables. Provided position and page information but had a long inference time (15 seconds to 2 minutes per PDF).
image2table with tesseract-ocr Showed good precision and recall. Sometimes failed to properly capture or separate column headers. Text extraction was good. Supported implicit rows and borderless tables, though this could reduce precision. Provided position and page information with a normal inference time of about 6 seconds per PDF.
LlamaParse Without parsing instructions, it misidentifies schemas, histograms, or complicated graphs as tables. With custom parsing instructions, it shows good precision and recall. Only fails in cases where the table spans multiple pages or has unusual formatting.
Unstructured Showed quite good performance overall.

← Previous: Pdf-Parsing

Next: S3_Data-Collection →

⚠️ **GitHub.com Fallback** ⚠️