Open Source OCR
OCR is rarely run on harmless text - it's invoices, contracts, passports, and medical records being turned into searchable data - so the question that decides the tool is where the page is read, not just how accurately. The open source engines here do the recognition locally, on your own hardware, which keeps confidential pages off third-party servers and lets you process thousands of documents without a per-page API bill.

PaddleOCR
OCR toolkit for turning PDFs and images into structured, LLM-ready JSON and Markdown

Tesseract
Open source OCR engine and command line tool for extracting text from images

MinerU
High-accuracy document parsing for PDFs, Office files, images, and web pages into Markdown or JSON

Paperless-ngx
Self-hosted document management system for scanning, indexing, and archiving paper files

ShareX
Free and open source Windows app for screen capture, recording, and file sharing

OCRmyPDF
Command-line OCR for scanned PDFs that adds a searchable text layer and keeps the original images intact

EasyOCR
Ready-to-use OCR for 80+ languages and major writing scripts

Surya
OCR and document analysis model for 90+ languages with layout, reading order, and tables

docTR
OCR library for document text detection and recognition with PyTorch and TensorFlow 2