Diff of two pdf files?

You can use DiffPDF for this. From the description:

DiffPDF is used to compare two PDF files. By default the comparison is of the text on each pair of pages, but comparing the appearance of pages is also supported (for example, if a diagram is changed or a paragraph reformatted). It is also possible to c> ompare particular pages or page ranges. For example, if there are two versions of a PDF file, one with pages 1-12 and the other with pages 1-13 because of an extra page having been added as page 4, they can be compared by specifying two page ranges, 1-12 for the first and 1-3, 5-13 for the second. This will make DiffPDF compare pages in the pairs (1, 1), (2, 2), (3, 3), (4, 5), (5, 6), and so on, to (12, 13).


I just figured out a hack to make DiffPDF (the program suggested by @qbi) usable for more than minor changes. What I do is concatenate all pages pdfs into a long scroll using pdfjam and then compare the scrolls. It works even when large sections are removed or inserted!

Here is a bash script that does the job:

#!/bin/bash
#
# Compare two PDF files.
# Dependencies:
#  - pdfinfo (xpdf)
#  - pdfjam  (texlive-extra-utils)
#  - diffpdf
#

MAX_HEIGHT=15840  #The maximum height of a page (in points), limited by pdfjam.

TMPFILE1=$(mktemp /tmp/XXXXXX.pdf)
TMPFILE2=$(mktemp /tmp/XXXXXX.pdf)

usage="usage: scrolldiff -h FILE1.pdf FILE2.pdf
  -h print this message

v0.0"

while getopts "h" OPTIONS ; do
    case ${OPTIONS} in
        h|-help) echo "${usage}"; exit;;
    esac
done
shift $(($OPTIND - 1))

if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
  echo "ERROR: input files do not exist."
  echo
  echo "$usage"
  exit
fi

    #Get the number of pages:
pages1=$( pdfinfo "$1" | grep 'Pages' - | awk '{print $2}' )
pages2=$( pdfinfo "$2" | grep 'Pages' - | awk '{print $2}' )
numpages=$pages2
if [[ $pages1 > $pages2 ]]
then
  numpages=$pages1
fi

     #Get the paper size:
width1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $3}' )
height1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $5}' )
width2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $3}' )
height2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $5}' )

if [ $(bc <<< "$width1 < $width2") -eq 1 ]
then
  width1=$width2
fi
if [ $(bc <<< "$height1 < $height2") -eq 1 ]
then
  height1=$height2
fi

height=$( echo "scale=2; $height1 * $numpages" | bc )
if [ $(bc <<< "$MAX_HEIGHT < $height") -eq 1 ]
then
  height=$MAX_HEIGHT
fi
papersize="${width1}pt,${height}pt"



    #Make the scrolls:
pdfj="pdfjam --nup 1x$numpages --papersize {${papersize}} --outfile"
$pdfj "$TMPFILE1" "$1"
$pdfj "$TMPFILE2" "$2"

diffpdf "$TMPFILE1" "$TMPFILE2"

rm -f $TMPFILE1 $TMPFILE2

Even though this doesn't solve the issue directly, here is a nice way to do it all from the commandline with few dependencies:

diff <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

https://linux.die.net/man/1/pdftotext

It works really well for basic pdf comparisons. If you have a newer version of pdftotext you can try -bbox instead of -layout.

As far as diffing programs go, I like using diffuse, so the command changes ever so slightly:

diffuse <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

http://diffuse.sourceforge.net/

Hope that helps.

Tags:

Pdf