Pdf Parsing - trankhoidang/RAG-wiki GitHub Wiki
✏️️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 24/06/2024
📥 Last Update: 01/07/2024
PDFs can contain a variety of content to extract such as text, images, tables and metadata, all of which must be accurately and wisely (e.g. keeping or not layout) extracted for robust knowledge database.
However, PDFs parsing is challenging due to their visual-based formatting, unordered storage, and the need to detect and extract various components like paragraphs, titles, headings, tables, images, captions, and metadata. The complexity is further compounded by diverse PDF types, including scanned PDFs, multi-column layouts, messy formatting, and varied font encoding systems.
Method | Description | Pros and Cons | Example tools |
---|---|---|---|
Rule-based | Utilizes hard-coded rules or templates to parse PDFs. |
Pros: Fast. Cons: Struggles with varied PDF layouts. |
Camelot, Pdfplumber, pdfminer.six, pdftotext, pikepdf, PyMuPDF, PyPDF, pypdfium2, Tabula, textract |
OCR-free small model-based | Uses transformer-based architectures instead of Optical Character Recognition (OCR). |
Pros: Avoids OCR inaccuracies. Cons: Can be computationally intensive. |
Donut, Nougat, Dessurt |
Multimodal LLM | Utilizes Multimodal Large Language Models (MLLMs) with prompt engineering and fine-tuning. |
Pros: High accuracy with fine-tuning. Cons: Requires extensive training data. |
TextMonkey, Llavar, GPT-4V |
Pipeline-based | Uses different approaches for each sub-task, including preprocessing, layout analysis, and structure recognition. |
Pros: More flexible and can handle complex documents. Cons: Requires more resources and time. |
Unstructured, Marker, LayoutParser |
Third-party APIs | Examples include APIs that provide parsing services. |
Pros: Easy to use and often provide high accuracy. Cons: Can be expensive and dependent on third-party services, with potential privacy and security concerns. |
Adobe Extract API, LlamaParse, Amazon Textract |
- Advanced RAG 02: Unveiling PDF Parsing | by Florian June | Towards AI
- Demystifying PDF Parsing 01: Overview | by Florian June | May, 2024 | Generative AI
- RAG Pipeline Pitfalls: The Untold Challenges of Embedding Table | by Ryan Nguyen | Towards AI
- Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition | by ChatDOC | Medium
- run-llama/llama_parse: Parse files for optimal RAG
- Unstructured 0.13.0 documentation
Belown shown the study on various PDF parsing tools based on multiple criteria. These criteria included the last maintenance date, GitHub stars, license, open-source status, availability of code, underlying technology, supported PDF types (all, scientific, or image-based), supported input and output formats, and capabilities in extracting text, tables, images, equations, and metadata.
Tools that did not offer Python usage/binding or were not recently maintained (older than 2 years) were excluded from the study. Additionally, direct OCR tools already included in some pipeline-based parsers and document layout analysis tools with pre-trained models were excluded due to their large inference times.
Last maintained dated | Tools | Link | Github stars | Licence | Open-source | Technology | PDF types | Input | Output | Support TEXT extraction | Support TABLE extraction | Support IMAGE extraction |
---|---|---|---|---|---|---|---|---|---|---|---|---|
- | Adobe Extract API | Link here | - | - | No | Adobe Sensei AI Framework | JSON, XLSX | Yes | Yes | Yes | ||
7 months | Apache PDFbox | Link here | 2400 | Apache-2.0 | Yes | - | - | - | - | - | - | |
2 days | Apache Tika | Link here | 2100 | Apache-2.0 | Yes | Apache PDFBox | Yes | Limited | Limited | |||
1 month | borb | Link here | 3300 | Yes | ?????? Cant find related parser code | Yes | No | Support but No | ||||
7 months | camelot | Link here | 2600 | MIT | Yes | PDFMiner (Stream), OpenCV (Lattice) | No scanned | CSV, Dataframe, JSON, MD, HTML, SQLITE | Limited | Yes++ | No | |
3 years | CascadeTabNet | Link here | 1400 | MIT | Yes | Cascade mask R-CNN HRNet | image-based | - | - | - | ||
2 years | CDeCNet | Link here | 131 | MIT | Yes | Mask R-CNN, cascade | - | - | - | - | ||
6 years | CERMINE | Link here | 479 | AGPL-3.0 | Yes | - | - | - | - | |||
1 day | DiT | Link here | - | MIT | Yes | Mask R-CNN, cascade, Vision Transformer, detectron2 | image-based | PDF, IMAGE | bounding box | No | Yes (detection) | Yes (detection) |
1 day | docTR | Link here | 3000 | Apache-2.0 | Yes | a OCR tool | image-based | - | - | - | - | |
7 months | DocumentLayoutAnalysis | Link here | 518 | - | Yes | - | - | - | - | |||
7 months | EasyOCR | Link here | 21900 | Apache-2.0 | Yes | a OCR tool | - | - | - | - | ||
1 day | GROBID | Link here | 3100 | Apache-2.0 | Yes | CRF, RNN, Transformers, pdfalto | Academic | TEI XML | Yes | Yes | Yes | |
2 months | img2table | Link here | 368 | MIT | Yes | Opencv, OCR | IMG, PDF | DATAFRAME | No | Yes | No | |
7 months | LlamaParse | Link here | 762 | MIT | Yes (API) | intended for RAG | JSON, MD, TXT, PNG | Yes | Yes | Yes | ||
1 month | llmsherpa | Link here | 916 | MIT | Yes (API) | intended for RAG | DOCX, PPTX, HTML, TXT, XML | JSON | Yes | Yes | No? | |
1 day | LLMware | Link here | 3100 | Apache-2.0 | Yes | ?????? Cant find related parser code | can't find | can't find | can't find | |||
2 years | Layout-Parser | Link here | 4400 | Apache-2.0 | Yes | - | - | - | - | |||
3 months | marker | Link here | 8000 | GPL-3.0 | Yes | Vision Transformer, OCR | PDF, EPUB, MOBI | MD, LATEX | Yes | Yes | No | |
6 months | Nougat | Link here | 8000 | MIT | Yes | Vision Transformer, OCR | Academic | TXT, MD, PNG, LATEX | Yes | Yes | Yes | |
7 months | Parsr | Link here | 5600 | Apache-2.0 | Yes | use many third party tools like pdfminer, camelot, pymupdf, tesseract, PDF.js | - | - | - | - | - | - |
1 day | PyMuPDF | Link here | 4000 | AGPL-3.0 | Yes | OCR, tesseract | PDF, TXT, SVG, EPUB, XPS, MOBI, FB2, CBZ | TXT, MD, DATAFRAME, PNG | Yes | Yes | Yes | |
1 week | PyPDF | Link here | 7400 | BSD 3-clause? | Yes | TXT | Yes | No | Limited | |||
2 weeks | pypdfium2 | Link here | 265 | Apache-2.0, BSD-3-Clause | Yes | use PDFium | Yes | No? | Limited | |||
5 days | PDFPlumber | Link here | 5500 | MIT | Yes | pdfminer | No scanned | Yes | Yes | No? | ||
3 years | PdfAct | Link here | 66 | Apache-2.0 | Yes | Rule-based, pdftotext | ||||||
1 week | PDFPig | Link here | 1500 | Apache-2.0 | Yes | - | - | - | - | |||
2 weeks | PDFparser | Link here | 2300 | LGPL-3.0 | Yes | PHP, rule-based | - | - | - | - | ||
2 days | PaddleOCR | Link here | 38400 | Apache-2.0 | Yes | a OCR tool | Yes | |||||
9 months | TableBank | Link here | 966 | Apache-2.0 | Yes | detectron2, dataset | image-based | No | Yes | No | ||
7 months | table-transformers | Link here | 1800 | MIT | Yes | PDF, PNG | HTML, CSV | No | Yes | No | ||
1 month | tabula-py | Link here | 2100 | MIT | Yes | Rule-based, PDFBox (Stream), OpenCV (Lattice) | No scanned | CSV, TSV, Dataframe, JSON | Limited | Yes | No | |
3 years pip maintained | textract | Link here | 3800 | MIT | Yes | multiple tools including pdfminer and pdftotext | any document? | text |
- A simple benchmark in term of text, image, and table extraction (quantitative or qualitative) is establised and given at A simple benchmark of PDF parsing tools.