OCR toolkit for turning PDFs and images into structured, LLM-ready JSON and Markdown
Apache-2.0
- Python
- C++
- TypeScript
About PaddleOCR
PaddleOCR is an OCR toolkit and document AI engine for converting PDFs and images into structured data. It is aimed at extracting text, tables, formulas, charts, and layout from documents so the output can be used by LLM and RAG workflows.
It supports document parsing with PaddleOCR-VL and PP-StructureV3, producing Markdown or JSON. It also covers scene OCR and multilingual text recognition, with PP-OCRv6 supporting 50 languages in a single unified model and the project stating support for 100+ languages overall. Browser inference is available through PaddleOCR.js, and different Paddle static graph, Paddle dynamic graph, or Transformers backends can be used for inference.
PaddleOCR is developed as part of the PaddlePaddle ecosystem and is used by projects such as Dify, RAGFlow, and Cherry Studio. Models are available on HuggingFace and ModelScope, and the software supports deployment on NVIDIA GPU, Intel CPU, Kunlunxin XPU, and other AI accelerators.
Key features
- Converts PDFs and images into Markdown or JSON
- Document parsing for text, tables, formulas, and charts
- PP-OCRv6 single model supports 50 languages
- Browser inference with PaddleOCR.js
- Supports Paddle, Transformers, and multiple hardware backends
Details
- First released
- 2020
- Platforms
- Web · Docker · CLI
- Deployment
- self-hostable · docker
- Languages
- 100+ languages
- Inference
- Paddle · Transformers
- Outputs
- Markdown · JSON · DOCX
