Open source OCR engine and command line tool for extracting text from images
Apache-2.0
- C++
- CMake
- Java

About Tesseract
Tesseract is an open source OCR engine with a command line program for turning images into text. It supports Unicode, more than 100 languages out of the box, and a traineddata-based workflow for adding language models. It also includes libtesseract for embedding OCR in other applications.
It can use a neural net LSTM engine focused on line recognition, while still supporting the legacy OCR engine mode. Input formats include PNG, JPEG, and TIFF, and output formats include plain text, hOCR, PDF, invisible-text-only PDF, TSV, ALTO, and PAGE. The C and C++ APIs let developers build their own applications on top of the engine.
Tesseract is maintained as an Apache 2.0 licensed project with source code on GitHub and current stable major version 5. It can be built from source or installed as a pre-built binary package. There is no GUI application in the core project, and the engine depends on Leptonica for image input.
Key features
- LSTM OCR engine with legacy mode support
- Unicode support and more than 100 languages
- Input images in PNG, JPEG, and TIFF
- Outputs plain text, hOCR, PDF, TSV, ALTO, and PAGE
- C and C++ APIs through libtesseract
Details
- First released
- 2014
- Platforms
- Windows · macOS · Linux
- Deployment
- offline-first
- License
- Apache 2.0
- Engine
- LSTM and legacy OCR
- Languages
- More than 100
