docling - Vermont-Complex-Systems/pdf-zoo GitHub Wiki

tags: #layout, #structuring, #ocr
inst: IBM Research
paper: https://arxiv.org/pdf/2408.09869
line: rag_langchain colab
deps: easyocr (ocr, default), tesseract (ocr, optionnal), pypdfium2

Document layout analysis toolkit with integrated OCR options.

What Makes Docling Stand Out

Enterprise Quality: Production-ready solution backed by IBM's DS4SD research team with solid academic foundations.

Flexible OCR Backends: Switch between easyOCR (default) and tesseract based on your needs - not locked into one engine.

Layout-First Approach: Understands document structure and preserves relationships between elements, not just text extraction.

RAG-Ready: Built-in examples for Retrieval Augmented Generation workflows with LangChain integration - perfect for AI applications.

Complete Pipeline: Handles everything from PDF input to structured output - OCR, layout analysis, element detection, and structuring in one tool.