Home - Vermont-Complex-Systems/pdf-zoo GitHub Wiki

📄 PDF Zoo Wiki

Welcome to the PDF Zoo - a comprehensive collection of tools, libraries, and models for PDF processing, OCR, and document understanding. From traditional OCR engines to cutting-edge LLM-powered solutions, find the right tool for your document processing needs.

🧩 The PDF Parsing Challenge

PDF parsing has evolved from simple OCR to sophisticated layout understanding and vision-language models. Learn about the history and ongoing challenges of document processing and why different tools exist for different needs.

⭐ My Faves

  • olmocr - Currently one of the most accurate OCR tools available, leveraging cutting-edge vision-language models
  • PyMuPDF - Excellent documentation, cost-effective to run, and reliable for most PDF processing tasks
  • surya - Great performance with bounding boxes and preserved reading order for layout-aware extraction
  • s2orc-doc2json - Excellent for parsing and structuring academic papers down to the section level, with robust citation extraction
  • MinerU - More complex but impressive at intelligently detecting and separating content from boilerplate text

🚀 Quick Start

Explore our curated collection organized by type and capability in the sidebar, or jump straight to these key categories:

📚 OCR Solutions

🔧 Processing Tools

☁️ Cloud Services

📋 Entry Format

Each tool entry follows this standardized format:

### [Tool Name](repository-url)
tags: `#category`, `#feature1`, `#feature2`  
deps: [dependency1](link), [dependency2](link)  
inst: `organization/company`  
paper: https://arxiv.org/abs/paper-id  
date: Month Year  
live: https://demo-url.com  
models: [model_name](link)  
limitations: Brief description of constraints or downsides   
> Brief description of capabilities and key strengths

<img src="workflow-diagram.png" alt="Tool workflow" width="600">

🏷️ Tag System (TAGxonomy)

Our classification system helps you find tools by type and functionality:

Primary Categories

Tag Description
#trad Traditional/classic OCR approaches
#model-based Modern neural network-based OCR
#llm Large language model-based OCR
#toolkit Multi-purpose libraries with various features
#ocr-free Transformer models that bypass traditional OCR
#cloud Enterprise cloud-based document processing services

Functionality Tags

Tag Description
#readingOrder Text reading sequence determination
#layoutAnalysis Document layout understanding
#pdf2markdown PDF to markdown conversion
#layout General layout processing
#multiModal Multiple input types support
#PDF-wrangling PDF manipulation and processing
#tableExtraction Table detection and extraction
#formulaOCR Mathematical formula recognition
#handwriting Handwritten text recognition
#multilingual Multi-language support
#structuring Unstructured to structured text
#gpu GPU acceleration support
#addTextLayer Adds searchable text layers to PDFs

🤝 Contributing

Found a tool we're missing? Want to suggest improvements? This wiki is a living document that grows with the community's contributions.