Multiple PDFs with page group included in a single page warning

PDF has a feature called "Page Groups" (PDF Reference, section 11.4.7). These descibe transparency effects between top-level objects on one page. When pdfTeX (or LuaTeX or XeTeX) includes a page from a PDF, it converts all pages into "Form XObjects" (section 8.10.1). pdfTeX also converts the Page Groups into /Group entries of the XObjects.

The problem now is that Adobe products need also a /Group entry (whose content should not matter) in the /Page object which contains these XObjects to correctly render transparency (this is just needed to select the right rendering engine; the transparency information for the included pages should be taken from these included pages).

pdfTeX will either use the first /Group it encounters when including PDFs or synthesize one when including PNGs with transparency. The warning is triggered when multiple Page Groups are encountered on one page (since the engine will then use the first one encountered and this may not be the "correct" one) and can probably be ignored. Of course this should be described somewhere in the pdfTeX documentation...


Update 2016-03-30:

Since version 1.40.15 (TeXlive 2014) pdfTeX has a parameter \pdfsuppresswarningpagegroup

Ordinarily, pdfTeX gives a warning when more than one included pdf file has a so-called “page group object” (/Group), because only one can “win” — that is, be propagated to the page level. Usually the page groups are identical, but when they are not, the result is unpredictable. It would be ideal if pdfTeX in fact detected whether the page groups were the same and only gave the warning in the problematic case; unfortunately, this is not easy (a patch would be welcome). Nevertheless, often one observes that there is no actual problem. Then seeing the warnings on every run is just noise, and can be suppressed by setting this parameter to a positive number.

So by adding \pdfsuppresswarningpagegroup=1 to the top of your file you can suppress this warning.


The problem is also reported in a german forum mrunix.de. It might be a bug in the tex distribution (pdftex). The problem happens only when you include multiple pdf pages, created in a specific manner (e.g. by MS Office products), in a single page.

Solution: Convert pdf files into ps and then back to pdf using Ghostscript, then the warning will go away (pdf2ps -> ps2pdf). This conversion probably removes the "page group" information from pdf files. (Caveat: This renders your pdf and some text might not be selectable or searchable any more.)

Editing the colorspace of pdf files with ghostscript also resolves the issue (if there is no multiple pages in the pdf file):

gs -o fixed-image.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/sRGB 
   -dProcessColorModel=/DeviceRGB original-image.pdf

CMYK conversion if RGB does not work for you:

gs -o fixed-image.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/CMYK 
   -dProcessColorModel=/DeviceCMYK original-image.pdf

P.S. Some programs generate "page group"s in pdf files; for example when you impose different images/objects in illustrator or inkscape. It seesm that pdftex is unable to handle multiple page groups in a single output page. The reason might be that each page groups specifies a different color space or transparency space.


Martin Schröder has done a wonderful job of explaining the underlying cause, so I won’t repeat that here. Other than telling pdfLaTeX to shut up, the solution would be to remove/strip the page groups from the PDF inputs. However, all the proposed solutions seem to suffer from one of these problems:

  • Lossy: Ghostscript-related solutions apparently rasterize the PDF file, which defeats the whole point of using PDF figures! I’m very picky when it comes to image quality, so this direction is no-go.
  • Fragile: Naively doing a find-and-replace (i.e. sed) to fix a PDF file is probably not a good idea. This could corrupt a PDF file.

Turns out there’s this neat open-source tool called QPDF, which can “unpack” PDF files into a regular, quasi-textual format, affectionately named “QDF”. After running this tool, it was really easy to identify the page group inside the QDF file using a plain text editor. A fragment is shown below:

%% Page 1
%% Original object ID: 5 0
4 0 obj
<<
  /Contents 5 0 R
  /Group <<
    /CS /DeviceRGB
    /I true
    /S /Transparency
    /Type /Group
  >>
  /MediaBox [
    0
    0
    460.799988
    345.600006
  ]
  /Parent 3 0 R
  /Resources 7 0 R
  /Type /Page
>>

Mine was created in Inkscape. Yours may be a bit different. Notice the /Group << … >> dictionary. This is what needs to be removed. This can be automated using a Python script:

import re, sys

stdin = getattr(sys.stdin, "buffer", sys.stdin)
stdout = getattr(sys.stdout, "buffer", sys.stdout)
stderr = getattr(sys.stderr, "buffer", sys.stderr)

page_group = None
for line in stdin:
    if page_group is None:
        if line.rstrip() == b"  /Group <<":
            page_group = [line]
        else:
            stdout.write(line)
    else:
        page_group.append(line)
        if line.rstrip() == b"  >>":
            break
else:
    if page_group:
        stdout.write(b"".join(page_group))
        page_group = None
for line in stdin:
    stdout.write(line)
stdout.flush()

if page_group:
    stderr.write(b"".join(page_group))
else:
    stderr.write(b"note: did not find page group\n")

Save that script to, say, strip_page_group.py and then chain all the commands together:

qpdf --qdf input.pdf - | python strip_page_group.py | fix-qdf >output.pdf

Note 1: Make sure the output filename (output.pdf) is different from the input filename (input.pdf) or you’ll lose the PDF file entirely!

Note 2: If you need deterministic output, supply qpdf with the --deterministic-id option.