How to copy text out of a PDF without losing formatting?

Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.

(A few recent PDFs do store some information about this stuff, but that's a new technology, and you'd be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)

Anyway, it's up to your software to implement some kind of "artificial intelligence" to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it's also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.

The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.

There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don't expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow or the AbiWord word processor (with all import/export plugins enabled). There's also a PDF import plugin for OpenOffice.

But please don't expect perfection with any of these results. You're going against the grain here. PDF just is not meant as an editable input format.


Another option is to download and start using the free pdf viewer, Foxit (its good). Then you can 'Save As' and choose .txt to convert it to a text file. That will preserve all the formatting. Dunno whether you can do the same in Adobe because I stopped using it a while ago when I converted to Foxit.


There is a very good online tool called Sej-da. Its deals with Advanced PDF Manipulation. There is no software to download. As it is a new online tool it is currently still in Beta. It allows you to extract text from a PDF, as well as providing a myriad of other PDF functionalities

http://www.sejda.com/

A brief video review of sejda functions was done 14th November 2012 by Revision 3 it can be found here:

http://revision3.com/tzdaily/sejda-online-pdf

Tags:

Pdf