How do I convert a scanned PDF into a PDF with text

gImageReader is a simple GTK+ front-end to tesseract-ocr.

sudo apt-get install gimagereader tesseract-ocr

sorry for the german text

You can try pdfocr:

 sudo add-apt-repository ppa:gezakovacs/pdfocr
 sudo apt-get update
 sudo apt-get install pdfocr

To execute the syntax is

 pdfocr -i input.pdf -o output.pdf

where input.pdf is the name of the input file and output.pdf the output file.

By default it uses Tesseract. To install it:

 sudo apt-get install tesseract-ocr

pdfocr creates an embedded text layer.


It loads tesseract and others on install. It's an easy one step solution and can be scripted. It can use hocr2pdf to create a plain text pdf, but its not ready for prime time...yet. The default uses tesseract and creates a "sandwiched" pdf: image + text underneath.

The embedded image can be removed with commands like:

gs -o ocr_noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE ocr_image.pdf

but the text is hidden, so it looks like a blank page.

Loading the PDF into LibreOffice Draw exposes the text and the image can be deleted manually.

