Tesseract

Open source OCR engine and command line tool for extracting text from images

Open Source Alternative to

Docling

Repository activity

Stars74.7k
Forks10.7k
Open Issues481

tesseract-ocr-tesseract health score - Linux Foundation Insights

License

Apache-2.0

Languages

C++
CMake
Java

Get it:Website GitHub

About Tesseract

Tesseract is an open source OCR engine with a command line program for turning images into text. It supports Unicode, more than 100 languages out of the box, and a traineddata-based workflow for adding language models. It also includes libtesseract for embedding OCR in other applications.

It can use a neural net LSTM engine focused on line recognition, while still supporting the legacy OCR engine mode. Input formats include PNG, JPEG, and TIFF, and output formats include plain text, hOCR, PDF, invisible-text-only PDF, TSV, ALTO, and PAGE. The C and C++ APIs let developers build their own applications on top of the engine.

Tesseract is maintained as an Apache 2.0 licensed project with source code on GitHub and current stable major version 5. It can be built from source or installed as a pre-built binary package. There is no GUI application in the core project, and the engine depends on Leptonica for image input.

Key features

LSTM OCR engine with legacy mode support
Unicode support and more than 100 languages
Input images in PNG, JPEG, and TIFF
Outputs plain text, hOCR, PDF, TSV, ALTO, and PAGE
C and C++ APIs through libtesseract

Details

First released: 2014
Platforms: Windows · macOS · Linux
Deployment: offline-first
License: Apache 2.0
Engine: LSTM and legacy OCR
Languages: More than 100