How do I convert a scanned PDF into a PDF with text

gImageReader is a simple GTK+ front-end to tesseract-ocr.

sudo apt-get install gimagereader tesseract-ocr

sorry for the german text


You can try pdfocr:

 sudo add-apt-repository ppa:gezakovacs/pdfocr
 sudo apt-get update
 sudo apt-get install pdfocr

To execute the syntax is

 pdfocr -i input.pdf -o output.pdf

where input.pdf is the name of the input file and output.pdf the output file.

By default it uses Tesseract. To install it:

 sudo apt-get install tesseract-ocr

pdfocr creates an embedded text layer.


pdfsandwich

It loads tesseract and others on install. It's an easy one step solution and can be scripted. It can use hocr2pdf to create a plain text pdf, but its not ready for prime time...yet. The default uses tesseract and creates a "sandwiched" pdf: image + text underneath.

The embedded image can be removed with commands like:

gs -o ocr_noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE ocr_image.pdf

but the text is hidden, so it looks like a blank page.

Loading the PDF into LibreOffice Draw exposes the text and the image can be deleted manually.

Tags:

Pdf