OCRmyPDF

Command-line OCR for scanned PDFs that adds a searchable text layer and keeps the original images intact

Repository activity

Stars33.9k
Forks2.3k
Open Issues107

ocrmypdf health score - Linux Foundation Insights

License

MPL-2.0

Languages

Python
Shell
Dockerfile

Get it:Website PyPI

About OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files so they can be searched or copy-pasted. It turns regular PDFs into searchable PDF/A files while keeping the exact resolution of embedded images, which helps preserve the original document layout and content.

It uses Tesseract OCR for recognition, supports more than 100 languages, and can deskew or clean pages when requested. It validates input and output files, distributes work across CPU cores, and can insert OCR information as a lossless operation without disrupting other content.

OCRmyPDF runs on Linux, Windows, macOS, and FreeBSD, with Docker images available for x64 and ARM. It is pure Python, licensed under MPL-2.0, and depends on Ghostscript and Tesseract OCR; it also provides a plugin interface and is integrated into paperless-ngx.

Key features

Adds a searchable text layer to scanned PDFs
Creates searchable PDF/A files
Keeps embedded image resolution intact
Deskews and cleans pages when requested
Uses Tesseract OCR with more than 100 languages

Details

First released: 2013
Platforms: Windows · macOS · Linux
Deployment: Self-hostable · Docker
Input: Scanned PDF files
Output: Searchable PDF/A
OCR engine: Tesseract OCR