How can I distinguish a digitally-created PDF from a searchable PDF?

With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.

As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr ("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".


Modified this answer from How to check if PDF is scanned image or contains text

In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).

I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.

I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.

Modified code:

import fitz # pip install PyMuPDF

def page_type(page):

    page_area = abs(page.rect) # Total page area

    img_area = 0.0
    for block in page.getText("RAWDICT")["blocks"]:
        if block["type"] == 1: # Type=1 are images
            bbox=block["bbox"]
            img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
    img_perc = img_area / page_area
    print("Image area proportion: " + str(img_perc))

    text_area = 0.0
    for b in page.getTextBlocks():
        r = fitz.Rect(b[:4])  # Rectangle where block text appears
        text_area = text_area + abs(r)
    text_perc = text_area / page_area
    print("Text area proportion: " + str(text_perc))

    if text_perc < 0.01: #No text = Scanned
        page_type = "Scanned"
    elif img_perc > .8:  #Has text but very large images = Searchable
        page_type = "Searchable text" 
    else:
        page_type = "Digitally created"
    return page_type


doc = fitz.open(pdffilepath)

for page in doc: #Iterate through pages to find different types
    print(page_type(page))

Tags:

Python

Pdf