How to cut-paste from PDF with non-ASCII encoding?

Are you able to paste text copied from the file into other programs like Notepad or Word or any other?

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.

For example, Distiller produces such files when "Smallest File Size" preset is used.

I have the same problem... Indeed it is explained here: http://forums.adobe.com/thread/915012

My solution was to convert the pdf to Word using the Exporting Tool of Acrobat and then extract the information I need from it.

It's frustrating but that work.

Another solution that I find is to convert the pdf in images (jpeg, png, etc) and then run an OCR process.

It is quite possible that the text contains characters that get copied correctly but your browser is unable to display them, due to lack of suitable font. A PDF document may contain embedded fonts, so Adobe Reader displays the characters OK, but a browser lacks access to those fonts.

You can check whether this is the reason by trying to copy and paste the characters here (it might be useful info about the problem anyway). You could also download and install the Code200x fonts, which contain pretty much any character you can normally expect to encounter. (It is not guaranteed, but probable, that Firefox will be able to use those fonts automatically when needed.)

How to cut-paste from PDF with non-ASCII encoding?

Tags:

Pdf

Unicode

Acrobat

Related

Recent Posts