Home - Vermont-Complex-Systems/pdf-zoo GitHub Wiki

📄 PDF Zoo Wiki

Welcome to the PDF Zoo - a comprehensive collection of tools, libraries, and models for PDF processing, OCR, and document understanding. From traditional OCR engines to cutting-edge LLM-powered solutions, find the right tool for your document processing needs.

🧩 The PDF Parsing Challenge

PDF parsing has evolved from simple OCR to sophisticated layout understanding and vision-language models. Learn about the history and ongoing challenges of document processing and why different tools exist for different needs.

⭐ My Faves

olmocr - Currently one of the most accurate OCR tools available, leveraging cutting-edge vision-language models
PyMuPDF - Excellent documentation, cost-effective to run, and reliable for most PDF processing tasks
surya - Great performance with bounding boxes and preserved reading order for layout-aware extraction
s2orc-doc2json - Excellent for parsing and structuring academic papers down to the section level, with robust citation extraction
MinerU - More complex but impressive at intelligently detecting and separating content from boilerplate text

🚀 Quick Start

Explore our curated collection organized by type and capability in the sidebar, or jump straight to these key categories:

📚 OCR Solutions

Traditional OCR - Time-tested engines like tesseract
Model-Based OCR - Deep learning tools: easyOCR, PaddleOCR, textra
LLM-Based OCR - Vision-language models: olmocr, GOT-ocr2, florence
Layout Analysis - Structure-aware: surya, grobid, publaynet

🔧 Processing Tools

OCR Toolkits - PyMuPDF, docling, OCRmyPDF
PDF to Markdown - marker, s2orc-doc2json, nougat
Text Extraction - pdfplumber, pdfminer.six, pypdf2
Text Structuring - NuExtract, langextract

☁️ Cloud Services

Enterprise APIs - Azure Document Intelligence, AWS Textract, Google Document AI

📋 Entry Format

Each tool entry follows this standardized format:

### [Tool Name](repository-url)
tags: `#category`, `#feature1`, `#feature2`  
deps: [dependency1](link), [dependency2](link)  
inst: `organization/company`  
paper: https://arxiv.org/abs/paper-id  
date: Month Year  
live: https://demo-url.com  
models: [model_name](link)  
limitations: Brief description of constraints or downsides   
> Brief description of capabilities and key strengths

<img src="workflow-diagram.png" alt="Tool workflow" width="600">

🏷️ Tag System (TAGxonomy)

Our classification system helps you find tools by type and functionality:

Primary Categories

Tag	Description
`#trad`	Traditional/classic OCR approaches
`#model-based`	Modern neural network-based OCR
`#llm`	Large language model-based OCR
`#toolkit`	Multi-purpose libraries with various features
`#ocr-free`	Transformer models that bypass traditional OCR
`#cloud`	Enterprise cloud-based document processing services

Functionality Tags

Tag	Description
`#readingOrder`	Text reading sequence determination
`#layoutAnalysis`	Document layout understanding
`#pdf2markdown`	PDF to markdown conversion
`#layout`	General layout processing
`#multiModal`	Multiple input types support
`#PDF-wrangling`	PDF manipulation and processing
`#tableExtraction`	Table detection and extraction
`#formulaOCR`	Mathematical formula recognition
`#handwriting`	Handwritten text recognition
`#multilingual`	Multi-language support
`#structuring`	Unstructured to structured text
`#gpu`	GPU acceleration support
`#addTextLayer`	Adds searchable text layers to PDFs

🤝 Contributing

Found a tool we're missing? Want to suggest improvements? This wiki is a living document that grows with the community's contributions.