Command-line OCR for scanned PDFs that adds a searchable text layer and keeps the original images intact
MPL-2.0
- Python
- Shell
- Dockerfile

About OCRmyPDF
OCRmyPDF adds an OCR text layer to scanned PDF files so they can be searched or copy-pasted. It turns regular PDFs into searchable PDF/A files while keeping the exact resolution of embedded images, which helps preserve the original document layout and content.
It uses Tesseract OCR for recognition, supports more than 100 languages, and can deskew or clean pages when requested. It validates input and output files, distributes work across CPU cores, and can insert OCR information as a lossless operation without disrupting other content.
OCRmyPDF runs on Linux, Windows, macOS, and FreeBSD, with Docker images available for x64 and ARM. It is pure Python, licensed under MPL-2.0, and depends on Ghostscript and Tesseract OCR; it also provides a plugin interface and is integrated into paperless-ngx.
Key features
- Adds a searchable text layer to scanned PDFs
- Creates searchable PDF/A files
- Keeps embedded image resolution intact
- Deskews and cleans pages when requested
- Uses Tesseract OCR with more than 100 languages
Details
- First released
- 2013
- Platforms
- Windows · macOS · Linux
- Deployment
- Self-hostable · Docker
- Input
- Scanned PDF files
- Output
- Searchable PDF/A
- OCR engine
- Tesseract OCR
