Pdf Parsing - trankhoidang/RAG-wiki GitHub Wiki

PDF Parsing for RAG

✏️️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 24/06/2024

📥 Last Update: 01/07/2024

Challenges in PDF Parsing

PDFs can contain a variety of content to extract such as text, images, tables and metadata, all of which must be accurately and wisely (e.g. keeping or not layout) extracted for robust knowledge database.

However, PDFs parsing is challenging due to their visual-based formatting, unordered storage, and the need to detect and extract various components like paragraphs, titles, headings, tables, images, captions, and metadata. The complexity is further compounded by diverse PDF types, including scanned PDFs, multi-column layouts, messy formatting, and varied font encoding systems.

Approaches to PDF Parsing

Method Description Pros and Cons Example tools
Rule-based Utilizes hard-coded rules or templates to parse PDFs. Pros: Fast.
Cons: Struggles with varied PDF layouts.
Camelot, Pdfplumber, pdfminer.six, pdftotext, pikepdf, PyMuPDF, PyPDF, pypdfium2, Tabula, textract
OCR-free small model-based Uses transformer-based architectures instead of Optical Character Recognition (OCR). Pros: Avoids OCR inaccuracies.
Cons: Can be computationally intensive.
Donut, Nougat, Dessurt
Multimodal LLM Utilizes Multimodal Large Language Models (MLLMs) with prompt engineering and fine-tuning. Pros: High accuracy with fine-tuning.
Cons: Requires extensive training data.
TextMonkey, Llavar, GPT-4V
Pipeline-based Uses different approaches for each sub-task, including preprocessing, layout analysis, and structure recognition. Pros: More flexible and can handle complex documents.
Cons: Requires more resources and time.
Unstructured, Marker, LayoutParser
Third-party APIs Examples include APIs that provide parsing services. Pros: Easy to use and often provide high accuracy.
Cons: Can be expensive and dependent on third-party services, with potential privacy and security concerns.
Adobe Extract API, LlamaParse, Amazon Textract

Some recommended blogs

Current text/table/image extraction tools - May 2024

Belown shown the study on various PDF parsing tools based on multiple criteria. These criteria included the last maintenance date, GitHub stars, license, open-source status, availability of code, underlying technology, supported PDF types (all, scientific, or image-based), supported input and output formats, and capabilities in extracting text, tables, images, equations, and metadata.

Tools that did not offer Python usage/binding or were not recently maintained (older than 2 years) were excluded from the study. Additionally, direct OCR tools already included in some pipeline-based parsers and document layout analysis tools with pre-trained models were excluded due to their large inference times.

Last maintained dated Tools Link Github stars Licence Open-source Technology PDF types Input Output Support TEXT extraction Support TABLE extraction Support IMAGE extraction
- Adobe Extract API Link here - - No Adobe Sensei AI Framework JSON, XLSX Yes Yes Yes
7 months Apache PDFbox Link here 2400 Apache-2.0 Yes - - - - - -
2 days Apache Tika Link here 2100 Apache-2.0 Yes Apache PDFBox Yes Limited Limited
1 month borb Link here 3300 Yes ?????? Cant find related parser code Yes No Support but No
7 months camelot Link here 2600 MIT Yes PDFMiner (Stream), OpenCV (Lattice) No scanned PDF CSV, Dataframe, JSON, MD, HTML, SQLITE Limited Yes++ No
3 years CascadeTabNet Link here 1400 MIT Yes Cascade mask R-CNN HRNet image-based - - -
2 years CDeCNet Link here 131 MIT Yes Mask R-CNN, cascade - - - -
6 years CERMINE Link here 479 AGPL-3.0 Yes - - - -
1 day DiT Link here - MIT Yes Mask R-CNN, cascade, Vision Transformer, detectron2 image-based PDF, IMAGE bounding box No Yes (detection) Yes (detection)
1 day docTR Link here 3000 Apache-2.0 Yes a OCR tool image-based - - - -
7 months DocumentLayoutAnalysis Link here 518 - Yes - - - -
7 months EasyOCR Link here 21900 Apache-2.0 Yes a OCR tool - - - -
1 day GROBID Link here 3100 Apache-2.0 Yes CRF, RNN, Transformers, pdfalto Academic PDF TEI XML Yes Yes Yes
2 months img2table Link here 368 MIT Yes Opencv, OCR IMG, PDF DATAFRAME No Yes No
7 months LlamaParse Link here 762 MIT Yes (API) intended for RAG PDF JSON, MD, TXT, PNG Yes Yes Yes
1 month llmsherpa Link here 916 MIT Yes (API) intended for RAG DOCX, PPTX, HTML, TXT, XML JSON Yes Yes No?
1 day LLMware Link here 3100 Apache-2.0 Yes ?????? Cant find related parser code can't find can't find can't find
2 years Layout-Parser Link here 4400 Apache-2.0 Yes - - - -
3 months marker Link here 8000 GPL-3.0 Yes Vision Transformer, OCR PDF, EPUB, MOBI MD, LATEX Yes Yes No
6 months Nougat Link here 8000 MIT Yes Vision Transformer, OCR Academic PDF TXT, MD, PNG, LATEX Yes Yes Yes
7 months Parsr Link here 5600 Apache-2.0 Yes use many third party tools like pdfminer, camelot, pymupdf, tesseract, PDF.js - - - - - -
1 day PyMuPDF Link here 4000 AGPL-3.0 Yes OCR, tesseract PDF, TXT, SVG, EPUB, XPS, MOBI, FB2, CBZ TXT, MD, DATAFRAME, PNG Yes Yes Yes
1 week PyPDF Link here 7400 BSD 3-clause? Yes PDF TXT Yes No Limited
2 weeks pypdfium2 Link here 265 Apache-2.0, BSD-3-Clause Yes use PDFium Yes No? Limited
5 days PDFPlumber Link here 5500 MIT Yes pdfminer No scanned PDF Yes Yes No?
3 years PdfAct Link here 66 Apache-2.0 Yes Rule-based, pdftotext
1 week PDFPig Link here 1500 Apache-2.0 Yes - - - -
2 weeks PDFparser Link here 2300 LGPL-3.0 Yes PHP, rule-based - - - -
2 days PaddleOCR Link here 38400 Apache-2.0 Yes a OCR tool Yes
9 months TableBank Link here 966 Apache-2.0 Yes detectron2, dataset image-based No Yes No
7 months table-transformers Link here 1800 MIT Yes PDF, PNG HTML, CSV No Yes No
1 month tabula-py Link here 2100 MIT Yes Rule-based, PDFBox (Stream), OpenCV (Lattice) No scanned PDF CSV, TSV, Dataframe, JSON Limited Yes No
3 years pip maintained textract Link here 3800 MIT Yes multiple tools including pdfminer and pdftotext any document? text

Further reading

← Previous: S02_Data-Parsing

Next: Pdf-Parsing-Benchmark →

⚠️ **GitHub.com Fallback** ⚠️