Why can't "fi" be separated when being copied from a compiled pdf?

cmap package was written to solve this problem. From the documentation:

The cmap package is intended to make the PDF files generated by pdflatex "searchable and copyable" in acrobat reader and other compliant PDF viewers.

Just put the line \usepackage{cmap} at the beginning of your preamble, and it will pre-load the needed CMap translations for the fonts used in the document, provided that there exists the file .cmap for the font encoding.

Current version of the CMap package includes CMap files for the following LaTeX font encodings: T1, T2A, T2B, T2C, T5, OT1, OT1tt, OT6, LGR, LAE, LFE.

This works for me:

\documentclass{article}
\usepackage{cmap}
\usepackage[T1]{fontenc}
\begin{document}
final
\end{document}

Update: Ulrike Fisher in comments says that pdfglyphtounicode is better, see Make ligatures in Linux Libertine copyable (and searchable)


The following is taken "verbatim" from the TeX Book (Chapter 9 TeX's Roman Fonts, p 51):

Let's begin with the rules for the normal roman font (\rm or \tenrm); plain TeX will use this font for everything unless you specify otherwise. Most of the ordinary symbols that you need are readily available and you can type them in the ordinary way: There's nothing special about

  • the letters A to Z and a to z
  • the digits 0 to 9
  • common punctuation marks : ; ! ? ( ) [ ] - * / . , @

except that TeX recognizes certain combinations as ligatures

  • ff yields
  • fi yields
  • fl yields
  • ffi yields
  • ffl yields
  • -- yields (an en-dash)
  • --- yields (an em-dash)
  • ‘‘ yields
  • ’’ yields
  • !‘ yields ¡
  • ?‘ yields ¿

Of course, TeX writes ligatures for most of its accents as well, as in \^o. The best way to think about ligatures is that they represent a single character in a font. As such, MS Word's "Insert Symbol" dialog is probably a good representation of this:

enter image description here

Note how some of the symbols occur in a single box, implying that are "joined at the hip" so to speak, representing a single character (or ligature) in the typeset output. Additionally, this is font specific, with different fonts having different (more or less) ligatures.


You wrote:

I notice that in a pdf file compiled from Latex, "fi" such as in "field" cannot be separated as "f" and "i" when copying text out of the pdf file. I wonder why and if this can be changed?

If it's already in the compiled pdf, there's not much you can do. Consider the following MWE:

\documentclass{standalone}
\usepackage[OT1]{fontenc}
\begin{document}
iffy fig flat office baffle
\end{document} 

If you compile this program with pdflatex, you should get this:

enter image description here

Note the appearance of the five ligated characters. However, if you copy-and-paste the output (I'm using TeXLive 2011 and TeXworks as the front end as I'm writing this), you'll get:

iy g 
at oce bae

I'm afraid that the data entry system for this site doesn't seem to render the various weird symbols properly, so you'll have to trust me when I say I see a pair of musical notes, a masculine and a feminine gender symbol, and some unrecognizable shapes. (Interestingly, the "fl" glyph appears to be represented by an invisible newline character, hence the characters "at" show up on the second line.)

However, if I change "OT1" to "T1" (the more-modern font encoding scheme for English-language text -- also OK for many non-English languages) in the MWE, recompile it, and copy-and-paste the output from the resulting pdf file to an ascii editor, I get:

iffy fig flat office baffle

as one would hope to get, i.e., all five ligated character combinations are now recovered correctly.

I haven't repeated this experiment with other modern font encodings, but I suspect that the problems -- i.e., the ligated glyphs being rendered incorrectly when copied from the pdf file to a plain-text file -- are specific to the OT1 font encoding method. If you come across a pdf file that was created with latex and the ligature glyphs look all funny, and if the opportunity presents itself, you may wish to ask the paper's author if he/she might be willing to recompile it with the \usepackage[T1]{fontenc} instruction in the preamble...

Tags:

Pdf

Ligatures