PDF and text layer

The PDF specification has no mention of a 'text layer'. Normally, there is just one way to 'store' text: by means of text showing operators. These operators draw text at a specific location, using a specific color, font, font size and text rendering mode. There are several text rendering modes. For the purpose of answering your question, text can be visible or invisible.

A scanner that performs OCR, renders both the raster image and text to the PDF document. The text is rendered using the invisible text rendering mode. The result is that you can select the text using a mouse (the highlighted area will be shown at the expected location on top of the image) and you can search for text. Again the search result will be shown at the correct location.

What happens when you generate PDF from a Word document depends on the software that you use to convert. To my knowledge, these converters do not generate an image but they will generate visible text.

XMP is meta data as opposed to visual data.

Finally, with respect to your question about detecting whether a PDF has text data, here is a similar question (10k only).


I upvoted Frank Rem's answer, because it is 'complete'.

Let me add a few details however:

  1. The 'invisibility' of text comes from Tr, the text rendering mode 3 operator in PDF: "Neither fill nor stroke text" (PDF-1.7 spec, Chapter 9.3.6).
  2. Have a look at this SuperUser question: "PDF has an extra blank in all words after running through Ghostscript" and my answers over there to learn a few more things about the technical details (esp. look at the one with the headline "How can we make the invisible text visible?").

Tags:

Pdf

Ocr

Scanning