How to OCR a PDF file and get the text stored within the PDF?

Best and easiest way out there is to use pypdfocr as it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

I think the command is pretty easy that it doesn't need any GUI. Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 
pip install pypdfocr 

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being maintained. ocrmypdf(module) does a similar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf #ubuntu
sudo dnf -y install ocrmypdf #fedora

After learning that Tesseract can now also produce searchable PDFs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich
./configure
make
sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable PDF.

Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.


An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.

  • http://live.gnome.org/OCRFeeder
  • https://github.com/GNOME/ocrfeeder