6 Best Open Source Alternatives to Docling

6 open source alternatives100% OSI-approved licensesUpdated June 2026

Looking to replace Docling? These open source alternatives give you the same core workflow without the lock-in - free to use, often self-hostable, and auditable on GitHub. Compare licenses, languages and project activity, then switch on your own terms.

PaddleOCR logo

1.PaddleOCR

82.2kApache-2.0Python Self-host

PaddleOCR is an OCR toolkit and document AI engine for converting PDFs and images into structured data. It is aimed at extracting text, tables, formulas, charts, and layout from documents so the output can be used by LLM and RAG workflows.

  • Converts PDFs and images into Markdown or JSON
  • Document parsing for text, tables, formulas, and charts
  • PP-OCRv6 single model supports 50 languages
  • Browser inference with PaddleOCR.js
Tesseract logo

2.Tesseract

74.7kApache-2.0C++
Tesseract screenshot

Tesseract is an open source OCR engine with a command line program for turning images into text. It supports Unicode, more than 100 languages out of the box, and a traineddata-based workflow for adding language models. It also includes libtesseract for embedding OCR in other applications.

  • LSTM OCR engine with legacy mode support
  • Unicode support and more than 100 languages
  • Input images in PNG, JPEG, and TIFF
  • Outputs plain text, hOCR, PDF, TSV, ALTO, and PAGE
MinerU logo

3.MinerU

67.5kOtherPython Self-host
MinerU screenshot

MinerU turns complex documents into structured Markdown or JSON for LLM, RAG, and agent workflows. It handles PDF, DOCX, PPTX, XLSX, images, and web pages, with support for scanned documents, handwriting, multi-column layouts, and cross-page table merging.

  • PDF, DOCX, PPTX, XLSX, images, and web pages to Markdown or JSON
  • VLM plus OCR dual engine with 109-language recognition
  • Formulas to LaTeX and tables to HTML
  • Scanned docs, handwriting, multi-column layouts, and cross-page table merging
Surya logo

4.Surya

20.8kApache-2.0Python Self-host
Surya screenshot

Surya is a 650M parameter OCR model for document intelligence. It turns document images into text and structured page data, including detected text, bounding boxes, and page-level OCR results.

  • 650M parameter OCR model
  • Layout analysis with reading order
  • Table recognition for rows and columns
  • Multilingual OCR across 90+ languages
docTR logo

5.docTR

6.1kApache-2.0Python Self-host
docTR screenshot

docTR is an optical character recognition library for parsing textual information from documents. It handles the core OCR workflow of locating words and identifying the characters in each word, with pretrained models for document analysis.

  • Two-stage OCR with text detection and text recognition
  • Pretrained models for document analysis
  • KIE predictor for structured predictions
  • PDF and image input support
Teedy logo

6.Teedy

2.6kGPL-2.0JavaScript Self-host
Teedy screenshot

Teedy is an open source document management system for individuals and businesses. It is built to store, organize, search, and share documents in one place, with support for common office files and video files.

  • Optical character recognition and full text search
  • Supports image, PDF, ODT, DOCX, PPTX, and video files
  • Workflow system with file versioning and tags
  • LDAP authentication, 2-factor authentication, and audit logs

Related alternatives