PDF Parsing for RAG

✏️️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 24/06/2024

📥 Last Update: 01/07/2024

Challenges in PDF Parsing

PDFs can contain a variety of content to extract such as text, images, tables and metadata, all of which must be accurately and wisely (e.g. keeping or not layout) extracted for robust knowledge database.

However, PDFs parsing is challenging due to their visual-based formatting, unordered storage, and the need to detect and extract various components like paragraphs, titles, headings, tables, images, captions, and metadata. The complexity is further compounded by diverse PDF types, including scanned PDFs, multi-column layouts, messy formatting, and varied font encoding systems.

Approaches to PDF Parsing

Method	Description	Pros and Cons	Example tools
Rule-based	Utilizes hard-coded rules or templates to parse PDFs.	Pros: Fast. Cons: Struggles with varied PDF layouts.	Camelot, Pdfplumber, pdfminer.six, pdftotext, pikepdf, PyMuPDF, PyPDF, pypdfium2, Tabula, textract
OCR-free small model-based	Uses transformer-based architectures instead of Optical Character Recognition (OCR).	Pros: Avoids OCR inaccuracies. Cons: Can be computationally intensive.	Donut, Nougat, Dessurt
Multimodal LLM	Utilizes Multimodal Large Language Models (MLLMs) with prompt engineering and fine-tuning.	Pros: High accuracy with fine-tuning. Cons: Requires extensive training data.	TextMonkey, Llavar, GPT-4V
Pipeline-based	Uses different approaches for each sub-task, including preprocessing, layout analysis, and structure recognition.	Pros: More flexible and can handle complex documents. Cons: Requires more resources and time.	Unstructured, Marker, LayoutParser
Third-party APIs	Examples include APIs that provide parsing services.	Pros: Easy to use and often provide high accuracy. Cons: Can be expensive and dependent on third-party services, with potential privacy and security concerns.	Adobe Extract API, LlamaParse, Amazon Textract

Some recommended blogs

Current text/table/image extraction tools - May 2024

Belown shown the study on various PDF parsing tools based on multiple criteria. These criteria included the last maintenance date, GitHub stars, license, open-source status, availability of code, underlying technology, supported PDF types (all, scientific, or image-based), supported input and output formats, and capabilities in extracting text, tables, images, equations, and metadata.

Tools that did not offer Python usage/binding or were not recently maintained (older than 2 years) were excluded from the study. Additionally, direct OCR tools already included in some pipeline-based parsers and document layout analysis tools with pre-trained models were excluded due to their large inference times.

Last maintained dated	Tools	Link	Github stars	Licence	Open-source	Technology	PDF types	Input	Output	Support TEXT extraction	Support TABLE extraction	Support IMAGE extraction
-	Adobe Extract API	Link here	-	-	No	Adobe Sensei AI Framework			JSON, XLSX	Yes	Yes	Yes
7 months	Apache PDFbox	Link here	2400	Apache-2.0	Yes		-	-	-	-	-	-
2 days	Apache Tika	Link here	2100	Apache-2.0	Yes	Apache PDFBox				Yes	Limited	Limited
1 month	borb	Link here	3300		Yes	?????? Cant find related parser code				Yes	No	Support but No
7 months	camelot	Link here	2600	MIT	Yes	PDFMiner (Stream), OpenCV (Lattice)	No scanned	PDF	CSV, Dataframe, JSON, MD, HTML, SQLITE	Limited	Yes++	No
3 years	CascadeTabNet	Link here	1400	MIT	Yes	Cascade mask R-CNN HRNet	image-based			-	-	-
2 years	CDeCNet	Link here	131	MIT	Yes	Mask R-CNN, cascade			-	-	-	-
6 years	CERMINE	Link here	479	AGPL-3.0	Yes				-	-	-	-
1 day	DiT	Link here	-	MIT	Yes	Mask R-CNN, cascade, Vision Transformer, detectron2	image-based	PDF, IMAGE	bounding box	No	Yes (detection)	Yes (detection)
1 day	docTR	Link here	3000	Apache-2.0	Yes	a OCR tool	image-based		-	-	-	-
7 months	DocumentLayoutAnalysis	Link here	518	-	Yes				-	-	-	-
7 months	EasyOCR	Link here	21900	Apache-2.0	Yes	a OCR tool			-	-	-	-
1 day	GROBID	Link here	3100	Apache-2.0	Yes	CRF, RNN, Transformers, pdfalto	Academic	PDF	TEI XML	Yes	Yes	Yes
2 months	img2table	Link here	368	MIT	Yes	Opencv, OCR		IMG, PDF	DATAFRAME	No	Yes	No
7 months	LlamaParse	Link here	762	MIT	Yes (API)	intended for RAG		PDF	JSON, MD, TXT, PNG	Yes	Yes	Yes
1 month	llmsherpa	Link here	916	MIT	Yes (API)	intended for RAG		DOCX, PPTX, HTML, TXT, XML	JSON	Yes	Yes	No?
1 day	LLMware	Link here	3100	Apache-2.0	Yes	?????? Cant find related parser code				can't find	can't find	can't find
2 years	Layout-Parser	Link here	4400	Apache-2.0	Yes				-	-	-	-
3 months	marker	Link here	8000	GPL-3.0	Yes	Vision Transformer, OCR		PDF, EPUB, MOBI	MD, LATEX	Yes	Yes	No
6 months	Nougat	Link here	8000	MIT	Yes	Vision Transformer, OCR	Academic	PDF	TXT, MD, PNG, LATEX	Yes	Yes	Yes
7 months	Parsr	Link here	5600	Apache-2.0	Yes	use many third party tools like pdfminer, camelot, pymupdf, tesseract, PDF.js	-	-	-	-	-	-
1 day	PyMuPDF	Link here	4000	AGPL-3.0	Yes	OCR, tesseract		PDF, TXT, SVG, EPUB, XPS, MOBI, FB2, CBZ	TXT, MD, DATAFRAME, PNG	Yes	Yes	Yes
1 week	PyPDF	Link here	7400	BSD 3-clause?	Yes			PDF	TXT	Yes	No	Limited
2 weeks	pypdfium2	Link here	265	Apache-2.0, BSD-3-Clause	Yes	use PDFium				Yes	No?	Limited
5 days	PDFPlumber	Link here	5500	MIT	Yes	pdfminer	No scanned	PDF		Yes	Yes	No?
3 years	PdfAct	Link here	66	Apache-2.0	Yes	Rule-based, pdftotext
1 week	PDFPig	Link here	1500	Apache-2.0	Yes				-	-	-	-
2 weeks	PDFparser	Link here	2300	LGPL-3.0	Yes	PHP, rule-based			-	-	-	-
2 days	PaddleOCR	Link here	38400	Apache-2.0	Yes	a OCR tool						Yes
9 months	TableBank	Link here	966	Apache-2.0	Yes	detectron2, dataset	image-based			No	Yes	No
7 months	table-transformers	Link here	1800	MIT	Yes			PDF, PNG	HTML, CSV	No	Yes	No
1 month	tabula-py	Link here	2100	MIT	Yes	Rule-based, PDFBox (Stream), OpenCV (Lattice)	No scanned	PDF	CSV, TSV, Dataframe, JSON	Limited	Yes	No
3 years pip maintained	textract	Link here	3800	MIT	Yes	multiple tools including pdfminer and pdftotext		any document?	text

Pdf Parsing - trankhoidang/RAG-wiki GitHub Wiki

PDF Parsing for RAG

Challenges in PDF Parsing

Approaches to PDF Parsing

Some recommended blogs

Current text/table/image extraction tools - May 2024

Further reading

⚠️ GitHub.com Fallback ⚠️

Pdf Parsing - trankhoidang/RAG-wiki GitHub Wiki

PDF Parsing for RAG

Challenges in PDF Parsing

Approaches to PDF Parsing

Some recommended blogs

Current text/table/image extraction tools - May 2024

Further reading

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️