Add page and line numbers to a pdf

Alright, here's a go at numbering lines in a PDF (or any other image format) without access to the source.

I wrote a little shell script that, using ImageMagick (at least version 6.6.9-4), converts a given PDF into separate raster images for each page, splits these into half pages, shrinks them to a width of one pixel (so takes the horizontal average, basically), turns this into a monochrome image with a given threshold (black=text, white=no text), shrinks every black sequence down to one pixel (=middle of a line), outputs this as a text, pipes it to sed to clean it up and remove all the non-text lines and finally writes a txt file with the position of each line as 1/1000 of the text height.

findlines.sh:

convert $1.pdf -crop 50x100% png:$1
for f in $1-*; do 
convert $f -flatten -resize 1X1000! -black-threshold 99% -white-threshold 10% -negate -morphology Erode Diamond -morphology Thinning:-1 Skeleton -black-threshold 50% txt:-| sed -e '1d' -e '/#000000/d' -e 's/^[^,]*,//' -e 's/[(]//g' -e 's/:.*//' -e 's/,/ /g' > $f.txt;
done

Running the script takes about 1 second for one page, resulting in a number of files: basename-<number>.txt, where odd <numbers> contain the positions of the left line numbers, and even <numbers> those of the right page numbers. These files can then be read by pgfplotstable (at least v 1.4) and be used to typeset the line numbers on top of the imported pdf file. I defined a command that takes the page number and four line numbers as arguments, where the four line numbers are used to tell the macro at which "raw" line numbers the "real" text lines start and end in the left and right column. By setting \pgfkeys{print raw line numbers=true}, the raw line numbers as found by the algorithm are shown in red.

\documentclass{article}
\usepackage{tikz}
\usepackage{pgfplotstable}

\newif\ifprintrawlinenumbers
\pgfkeys{print raw line numbers/.is if=printrawlinenumbers,
  print raw line numbers=true}
\newcommand{\addlinenumbers}[5]{
  \pgfmathtruncatemacro{\leftnumber}{(#1-1)*2}
  \pgfmathtruncatemacro{\rightnumber}{(#1-1)*2+1}
  \pgfplotstableread{\pdfname-\leftnumber.txt}\leftlines
  \pgfplotstableread{\pdfname-\rightnumber.txt}\rightlines
  \begin{tikzpicture}[font=\tiny,anchor=east]
  \node[anchor=south west,inner sep=0] (image) at (0,0) {\includegraphics[width=14cm,page=#1]{\pdfname.pdf}};
    \begin{scope}[x={(image.south east)},y={(image.north west)}]
      \pgfplotstableforeachcolumnelement{[index] 0}\of\leftlines\as\position{
        \ifprintrawlinenumbers
          \node [font=\tiny,red] at (0.04,1-\position/1000)         {\pgfplotstablerow};
        \fi
        \pgfmathtruncatemacro{\checkexcluded}{
          (\pgfplotstablerow>=#2 && \pgfplotstablerow<=#3) ? 1 : 0)
        }
        \ifnum\checkexcluded=1
          \pgfmathtruncatemacro\linenumber{\pgfplotstablerow-#2+1}
          \node [font=\tiny,align=right,anchor=east] at (0.08,1-\position/1000) {\linenumber};
        \fi
      }
      \pgfplotstablegetrowsof{\leftlines}
      \pgfmathtruncatemacro\rightstart{min((\pgfplotsretval-#2),(#3-#2+1))}
      \pgfplotstableforeachcolumnelement{[index] 0}\of\rightlines\as\position{
        \ifprintrawlinenumbers
          \node [font=\tiny,red,anchor=east] at (1.0,1-\position/1000) {\pgfplotstablerow};
        \fi
        \pgfmathtruncatemacro{\checkexcluded}{
                  (\pgfplotstablerow>=#4 && \pgfplotstablerow<=#5) ? 1 : 0)
        }
        \ifnum\checkexcluded=1
          \pgfmathtruncatemacro\linenumber{\pgfplotstablerow-#4+\rightstart+1}
          \node [font=\tiny] at (0.96,1-\position/1000) {\linenumber};
        \fi
      }
    \end{scope}
  \end{tikzpicture}
}

\begin{document}

\def\pdfname{article}
\addlinenumbers{1}{20}{50}{2}{65}
\pgfkeys{print raw line numbers=false}
\addlinenumbers{2}{0}{69}{0}{64}
\addlinenumbers{3}{19}{47}{21}{48}

\end{document}

As a proof of concept, here's the output for the first two pages of an article from the Environmental Science & Technology Journal. I think it works really well. I haven't been able to call findlines.sh from within LaTeX, though, this step has to be performed manually before compiling the .tex file.

first page of a pdf with line numbers

second page of a pdf with line numbers


You can do (1) easily with the pdfpages package.

\documentclass{article}
\usepackage{pdfpages}
\begin{document}
\includepdf[pages=1-,pagecommand={\thispagestyle{plain}}]{<pdffile>}
\end{document}

In the example document, I simply passed the pagestyle plain to the pagecommand, but using the fancyhdr package you can make any kind of extra header/footer you like. To place the page number appropriately you may also need to adjust the margins using the geometry package. For example:

\documentclass{article}
\usepackage[margin=.5in]{geometry}
\usepackage{pdfpages}
\usepackage{fancyhdr}
\fancyhf{}
\renewcommand{\headrulewidth}{0pt}
\lfoot{\textit{My pdf document}}
\rfoot{\thepage}
\begin{document} 
\includepdf[pages=1-,pagecommand={\thispagestyle{fancy}}]{<pdffile>}
\end{document}

This places a footer containing "My pdf document" on the left and the page number on the right. The margin is made very small so that the page number won't likely interfere with the included document.

To make sure the paper size of the output PDF is the same as the included PDF, add the fitpaper option to \includepdf. From the pdfpages manual:

fitpaper Adjusts the paper size to the one of the inserted document.

See Jake's answer for a very ingenious method of adding line numbers to an existing pdf.


If I understand your need to add line numbers to the PDF, you can by using the lineno package. It does, however, only add line numbers according to how LaTeX sets up the text, which can be quite different from the source.

\documentclass[11pt,a4paper]{article}
\usepackage{lineno}
\usepackage{lipsum}
\begin{document}
    \linenumbers
    \lipsum
\end{document}

Line number example