Is there a better pdf to text converter than pdftotext?

If you are using pdftotext you can use the -layout flag to preserve the layout of the text on the pages in your input pdf file:

pdftotext -layout input.pdf output.txt

You could try ebook-convert from Calibre.

If anything, I'd say it errs in the other direction: too many line breaks.

Another thing I'd definitely consider though is converting to HTML using pdfreflow, and then convert the HTML to TXT.


As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.

(I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.)

I've been comparing the output side-by-side. My second choice is ebook-convert.

Adobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. Junk that was hidden in the PDF did not get output. Correctly got the big capitals at start of sections, e.g. "The", not "T he" or even "T he".

ebook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Converts most paragraphs to be single lines. The ones it missed are double-spaced though! Bullets don't always line up with the text. Correctly got "The" at the start of the chapter.

pdftotext (without --layout): Not bad, bullets line up, but header/footer noise. FFs are in there. Hyphens removed. Worst for start of chapter big letters: "T\n\nhe".

pdftotext (with --layout): Similar, but more indents. "T he" for start of chapter.

pdftohtml >> pdfreflow >> htmltotext: It removed page numbers, but still junk in header/footer. "T he" for start of chapter. Hyphens removed. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)

Tags:

Pdf

Conversion