How to automatically find non-searchable PDFs

I'm not sure if this is a 100% solution, but I came up with the following script which should get you a good part of the way if not the whole way (I have not gone through the spec) It should be run from the directory which has all the PDF's (it will search subdirectories).

#! /bin/bash

if [[ ! "$#" = "1" ]]
  then
      echo "Usage: $0 /path/to/PDFDirectory"
      exit 1
fi

PDFDIRECTORY="$1"

while IFS= read -r -d $'\0' FILE; do
    PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
    RET_PDFFONTS="$?"
    FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
    if [[ ! "$RET_PDFFONTS" = "0" ]]
      then
          READ_ERROR=1
          echo "Error while reading $FILE. Skipping..."
          continue
    fi
    if [[ "$FONTS" = "0" ]]
      then
          echo "NOT SEARCHABLE: $FILE"
      else
          echo "SEARCHABLE: $FILE"
    fi
done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)

echo "Done."
if [[ "$READ_ERROR" = "1" ]]
  then
      echo "There were some errors."
fi

It works by looking for the number of fonts specified in each PDF. If the file does not have any fonts it is assumed to be comprised only of an image. (This might trip up on password protected files, I have no idea, don't have any to test against). If there is some stuff which is searchable and some stuff which is an image, this won't work - but it will probably be useful to seperate scanned image documents in a PDF container from "real" PDF's.

You can, of-course, comment out the part of the if-then-else loop which does not apply if you only want to print out the files which are not searchable.


I will use a trick, it is a peculiar secondary fact I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.

First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:

gedit check_pdf_searchable.sh

then paste this

#!/bin/bash 
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1" # alternatively you can use ocrmypdf "$1" "${1}_ocr.pdf"
fi

then make it executable

chmod +x check_pdf_searchable.sh

then list all non-searchable pdfs in the directory:

ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}

Tags:

Pdf

Ocr