Home - Vermont-Complex-Systems/pdf-zoo GitHub Wiki
📄 PDF Zoo Wiki
Welcome to the PDF Zoo - a comprehensive collection of tools, libraries, and models for PDF processing, OCR, and document understanding. From traditional OCR engines to cutting-edge LLM-powered solutions, find the right tool for your document processing needs.
The PDF Parsing Challenge
🧩PDF parsing has evolved from simple OCR to sophisticated layout understanding and vision-language models. Learn about the history and ongoing challenges of document processing and why different tools exist for different needs.
⭐ My Faves
- olmocr - Currently one of the most accurate OCR tools available, leveraging cutting-edge vision-language models
- PyMuPDF - Excellent documentation, cost-effective to run, and reliable for most PDF processing tasks
- surya - Great performance with bounding boxes and preserved reading order for layout-aware extraction
- s2orc-doc2json - Excellent for parsing and structuring academic papers down to the section level, with robust citation extraction
- MinerU - More complex but impressive at intelligently detecting and separating content from boilerplate text
🚀 Quick Start
Explore our curated collection organized by type and capability in the sidebar, or jump straight to these key categories:
📚 OCR Solutions
- Traditional OCR - Time-tested engines like tesseract
- Model-Based OCR - Deep learning tools: easyOCR, PaddleOCR, textra
- LLM-Based OCR - Vision-language models: olmocr, GOT-ocr2, florence
- Layout Analysis - Structure-aware: surya, grobid, publaynet
🔧 Processing Tools
- OCR Toolkits - PyMuPDF, docling, OCRmyPDF
- PDF to Markdown - marker, s2orc-doc2json, nougat
- Text Extraction - pdfplumber, pdfminer.six, pypdf2
- Text Structuring - NuExtract, langextract
☁️ Cloud Services
- Enterprise APIs - Azure Document Intelligence, AWS Textract, Google Document AI
📋 Entry Format
Each tool entry follows this standardized format:
### [Tool Name](repository-url)
tags: `#category`, `#feature1`, `#feature2`
deps: [dependency1](link), [dependency2](link)
inst: `organization/company`
paper: https://arxiv.org/abs/paper-id
date: Month Year
live: https://demo-url.com
models: [model_name](link)
limitations: Brief description of constraints or downsides
> Brief description of capabilities and key strengths
<img src="workflow-diagram.png" alt="Tool workflow" width="600">
🏷️ Tag System (TAGxonomy)
Our classification system helps you find tools by type and functionality:
Primary Categories
Tag | Description |
---|---|
#trad |
Traditional/classic OCR approaches |
#model-based |
Modern neural network-based OCR |
#llm |
Large language model-based OCR |
#toolkit |
Multi-purpose libraries with various features |
#ocr-free |
Transformer models that bypass traditional OCR |
#cloud |
Enterprise cloud-based document processing services |
Functionality Tags
Tag | Description |
---|---|
#readingOrder |
Text reading sequence determination |
#layoutAnalysis |
Document layout understanding |
#pdf2markdown |
PDF to markdown conversion |
#layout |
General layout processing |
#multiModal |
Multiple input types support |
#PDF-wrangling |
PDF manipulation and processing |
#tableExtraction |
Table detection and extraction |
#formulaOCR |
Mathematical formula recognition |
#handwriting |
Handwritten text recognition |
#multilingual |
Multi-language support |
#structuring |
Unstructured to structured text |
#gpu |
GPU acceleration support |
#addTextLayer |
Adds searchable text layers to PDFs |
🤝 Contributing
Found a tool we're missing? Want to suggest improvements? This wiki is a living document that grows with the community's contributions.