PaddleOCR

OCR toolkit for turning PDFs and images into structured, LLM-ready JSON and Markdown

Open Source Alternative to

Docling

Repository activity

Stars82.2k
Forks10.8k
Open Issues226

paddlepaddle-paddleocr health score - Linux Foundation Insights

License

Apache-2.0

Languages

Python
C++
TypeScript

Get it:Website GitHub

About PaddleOCR

PaddleOCR is an OCR toolkit and document AI engine for converting PDFs and images into structured data. It is aimed at extracting text, tables, formulas, charts, and layout from documents so the output can be used by LLM and RAG workflows.

It supports document parsing with PaddleOCR-VL and PP-StructureV3, producing Markdown or JSON. It also covers scene OCR and multilingual text recognition, with PP-OCRv6 supporting 50 languages in a single unified model and the project stating support for 100+ languages overall. Browser inference is available through PaddleOCR.js, and different Paddle static graph, Paddle dynamic graph, or Transformers backends can be used for inference.

PaddleOCR is developed as part of the PaddlePaddle ecosystem and is used by projects such as Dify, RAGFlow, and Cherry Studio. Models are available on HuggingFace and ModelScope, and the software supports deployment on NVIDIA GPU, Intel CPU, Kunlunxin XPU, and other AI accelerators.

Key features

Converts PDFs and images into Markdown or JSON
Document parsing for text, tables, formulas, and charts
PP-OCRv6 single model supports 50 languages
Browser inference with PaddleOCR.js
Supports Paddle, Transformers, and multiple hardware backends

Details

First released: 2020
Platforms: Web · Docker · CLI
Deployment: self-hostable · docker
Languages: 100+ languages
Inference: Paddle · Transformers
Outputs: Markdown · JSON · DOCX