OCRmyPDF logo

OCRmyPDF

Command-line OCR for scanned PDFs that adds a searchable text layer and keeps the original images intact

Repository activity
  • Stars33.9k
  • Forks2.3k
  • Open Issues107
ocrmypdf health score - Linux Foundation Insights
License

MPL-2.0

Languages
  • Python
  • Shell
  • Dockerfile
OCRmyPDF screenshot

About OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files so they can be searched or copy-pasted. It turns regular PDFs into searchable PDF/A files while keeping the exact resolution of embedded images, which helps preserve the original document layout and content.

It uses Tesseract OCR for recognition, supports more than 100 languages, and can deskew or clean pages when requested. It validates input and output files, distributes work across CPU cores, and can insert OCR information as a lossless operation without disrupting other content.

OCRmyPDF runs on Linux, Windows, macOS, and FreeBSD, with Docker images available for x64 and ARM. It is pure Python, licensed under MPL-2.0, and depends on Ghostscript and Tesseract OCR; it also provides a plugin interface and is integrated into paperless-ngx.

Key features

  • Adds a searchable text layer to scanned PDFs
  • Creates searchable PDF/A files
  • Keeps embedded image resolution intact
  • Deskews and cleans pages when requested
  • Uses Tesseract OCR with more than 100 languages

Details

First released
2013
Platforms
Windows · macOS · Linux
Deployment
Self-hostable · Docker
Input
Scanned PDF files
Output
Searchable PDF/A
OCR engine
Tesseract OCR