Split PDF into documents with several pages each

pdftk is able to cut out a fixed set of pages efficiently. With a bit of scripting glue, this does what I want:

number=$(pdfinfo -- "$file" 2> /dev/null | awk '$1 == "Pages:" {print $2}')
count=$((number / pagesper))
filename=${file%.pdf}

counter=0
while [ "$count" -gt "$counter" ]; do 
  start=$((counter*pagesper + 1));
  end=$((start + pagesper - 1));

  counterstring=$(printf %04d "$counter")
  pdftk "$file" cat "${start}-${end}" output "${filename}_${counterstring}.pdf"

  counter=$((counter + 1))
done

This assumes that you have the number of pages per chunk in $pagesper and the filename of the source PDF in $file.

If you have acroread installed, you can also use

acroread -size a4 -start "$start" -end "$end" -pairs "$file" "${filename}_${counterstring}.ps"

acroread offers the option -toPostScript which may be useful.


See also pdfseparate and pdfunite from poppler-utils. pdfseparate breaks the file into one file per page which makes it relatively easy to reassemble at will later on with pdfunite, manually or (semi-)automatically.

Like with zsh:

autoload zargs

reunite() pdfunite "$@" file-$1-$argv[-1].pdf

pdfseparate file.pdf p%d
zargs -n 5 p<->(n) -- reunite
rm -f p<->

would split file.pdf into file-p1-p5.pdf, file-p6-p10.pdf...


I find Python with the PyPdf library convenient for those jobs that pdftk doesn't do conveniently (or at all).

#!/usr/bin/env python
import sys
from pyPdf import PdfFileWriter, PdfFileReader

# Command line parsing
if len(sys.argv) < 2 or sys.argv[1][-4:] != '.pdf':
    sys.stderr.writeln('Usage: ' + sys.argv[0] + ''' FILE.pdf N
Split FILE.pdf into chunks of N pages each.''')
    exit(3)
pages_per_file = int(sys.argv[2])

base_name = sys.argv[1][:-4] + '-'
input_pdf = PdfFileReader(open(sys.argv[1]))
output_pdf = PdfFileWriter()
num_pages = input_pdf.getNumPages()
for i in xrange(num_pages):
    output_pdf.addPage(input_pdf.getPage(i))
    if (i + 1) % pages_per_file == 0 or i + 1 == num_pages:
        output_file = open(base_name + str(i / pages_per_file + 1) + '.pdf', "wb")
        output_pdf.write(output_file)
        output_file.close()
        output_pdf = PdfFileWriter()