PDF OCR

Turn scanned PDFs into searchable, copyable, indexable PDFs. Free, private, runs in your browser.

100% private — your files and text never leave your browser. All processing happens locally on your device.

Choose a file

or drop it here

PDF·Up to 100MB

Upload a scanned PDF — we'll run OCR in your browser. Keep it as a searchable PDF, or export the text only.

You might also need

PDF to TextExtract text content from PDF files

Scan to PDFScan documents with your camera into a PDF

Compress PDFReduce PDF file size

PDF to ImagesConvert PDF pages to JPG or PNG

What Is a Scanned PDF?

A scanned PDF is a PDF that was produced by photographing or scanning a paper document — the pages look like text, but under the hood they're just images. You can see the text with your eyes, but the PDF has no idea what letters those are. You can't select text, copy-paste a sentence, search for a phrase, or have a screen reader announce the words. For any serious use — finding a specific clause, quoting from a contract, archiving correspondence with Spotlight or Windows Search — a scanned PDF is effectively opaque.

What OCR Does

OCR (optical character recognition) looks at the image of each page and figures out what the letters are. The output of this tool is a PDF that looks EXACTLY like the original — same layout, same images, same everything — but with a hidden text layer stamped on top of each page at the positions where the OCR engine recognised words. The hidden layer is invisible to the eye but fully indexed by every major PDF viewer and search tool. Select a sentence: you're grabbing the OCR-recognised words. Search the PDF: you're searching those words. The visual output is unchanged.

How Good Is the Recognition?

We use Tesseract 5 with the eng_best language model, which hits ~95% word accuracy on clean printed documents scanned at 300 DPI or higher. Messy scans (low resolution, tilted pages, poor lighting, heavy JPEG compression) will drop that number. Handwriting is not supported — Tesseract is a printed-text recogniser, not a handwritten one. For best results, rescan poor-quality originals at 300+ DPI before running OCR. Before that, run the scan through our image-enhancement tools or simply retake the photos in better lighting.

Does It Change the File?

The visible content is untouched. We copy the original pages verbatim into the output PDF, then overlay the invisible text layer. Vectors, embedded fonts, page sizes, and metadata are preserved. The output file will be slightly larger than the input — that's the OCR text layer on each page. Expect 5–15% growth depending on page density and text quantity.

Is It Private?

Yes. Tesseract runs in your browser via WebAssembly. Your PDF is never uploaded. The only network fetch is the English language model (~10 MB one-time, cached) from a public CDN (tessdata.projectnaptha.com) — that's the model itself, not your document. If you're on a restricted network that blocks the CDN, the OCR will silently fall back to the smaller default English model which is bundled with tesseract.js.

FAQ

What does OCR do?

OCR (optical character recognition) reads the image on each page and figures out what the letters are. The output PDF looks identical to your input but the text is now selectable, copyable, and indexable by search tools like Spotlight, Everything, or Windows Search.

How accurate is it?

Tesseract with the English model (we ship eng_best) hits ~95% word accuracy on clean printed documents at 300 DPI. Low-resolution scans, tilted pages, or handwriting will drop that number significantly.

Can I OCR non-English PDFs?

Right now only English is bundled. If you need another language, open a feedback issue — we can add German, French, Spanish, and so on on request.

Is this private?

Yes. Tesseract runs in your browser via WebAssembly. Your PDF never leaves your device. The English language model is downloaded once from a public CDN (tessdata.projectnaptha.com) and cached.