Python Data Extraction from an Encrypted PDF

LAST UPDATED 10-11-2019

I'm unsure if I understand your question completely. The code below can be refined, but it reads in either an encrypted or unencrypted PDF and extracts the text. Please let me know if I misunderstood your requirements.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_encrypted_pdf_text(path, encryption_true, decryption_password):

  output = StringIO()

  resource_manager = PDFResourceManager()
  laparams = LAParams()

  device = TextConverter(resource_manager, output, codec='utf-8', laparams=laparams)

  pdf_infile = open(path, 'rb')
  interpreter = PDFPageInterpreter(resource_manager, device)

  page_numbers = set()

  if encryption_true == False:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, caching=True, check_extractable=True):
      interpreter.process_page(page)

  elif encryption_true == True:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, password=decryption_password, caching=True, check_extractable=True):
      interpreter.process_page(page)

 text = output.getvalue()
 pdf_infile.close()
 device.close()
 output.close()
return text

results = extract_encrypted_pdf_text('encrypted.pdf', True, 'password')
print (results)

I noted that your pikepdf code used to open an encrypted PDF was missing a password, which should have thrown this error message:

pikepdf._qpdf.PasswordError: encrypted.pdf: invalid password

import pikepdf

with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")

You can use tika to extract the text from the decrypted.pdf created by pikepdf.

from tika import parser

parsedPDF = parser.from_file("decrypted.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

Additionally, pikepdf does not currently implement text extraction this includes the latest release v1.6.4.


I decided to run a couple of test using various encrypted PDF files.

I named all the encrypted files 'encrypted.pdf' and they all used the same encryption and decryption password.

  1. Adobe Acrobat 9.0 and later - encryption level 256-bit AES

    • pikepdf was able to decrypt this file
    • PyPDF2 could not extract the text correctly
    • tika could extract the text correctly
  2. Adobe Acrobat 6.0 and later - encryption level 128-bit RC4

    • pikepdf was able to decrypt this file
    • PyPDF2 could not extract the text correctly
    • tika could extract the text correctly
  3. Adobe Acrobat 3.0 and later - encryption level 40-bit RC4

    • pikepdf was able to decrypt this file
    • PyPDF2 could not extract the text correctly
    • tika could extract the text correctly
  4. Adobe Acrobat 5.0 and later - encryption level 128-bit RC4

    • created with Microsoft Word
    • pikepdf was able to decrypt this file
    • PyPDF2 could extract the text correctly
    • tika could extract the text correctly
  5. Adobe Acrobat 9.0 and later - encryption level 256-bit AES

    • created using pdfprotectfree
    • pikepdf was able to decrypt this file
    • PyPDF2 could extract the text correctly
    • tika could extract the text correctly

PyPDF2 was able to extract text from decrypted PDF files not created with Adobe Acrobat.

I would assume that the failures have something to do with embedded formatting in the PDFs created by Adobe Acrobat. More testing is required to confirm this conjecture about the formatting.

tika was able to extract text from all the documents decrypted with pikepdf.


 import pikepdf
 with pikepdf.open("encrypted.pdf", password='password') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save("decrypted.pdf")


 from PyPDF2 import PdfFileReader

 def text_extractor(path):
   with open(path, 'rb') as f:
     pdf = PdfFileReader(f)
     page = pdf.getPage(1)
     print('Page type: {}'.format(str(type(page))))
     text = page.extractText()
     print(text)

    text_extractor('decrypted.pdf')

PyPDF2 cannot decrypt Acrobat PDF files => 6.0

This issue has been open with the module owners, since September 15, 2015. It unclear in the comments related to this issue when this problem will be fixed by the project owners. The last commit was June 25, 2018.

PyPDF4 decryption issues

PyPDF4 is the replacement for PyPDF2. This module also has decryption issues with certain algorithms used to encrypt PDF files.

test file: Adobe Acrobat 9.0 and later - encryption level 256-bit AES

PyPDF2 error message: only algorithm code 1 and 2 are supported

PyPDF4 error message: only algorithm code 1 and 2 are supported. This PDF uses code 5


UPDATE SECTION 10-11-2019

This section is in response to your updates on 10-07-2019 and 10-08-2019.

In your update you stated that you could open a 'secured pdf with Adobe Reader' and print the document to another PDF, which removes the 'SECURED' flag. After doing some testing, I believe that have figured out what is occurring in this scenario.

Adobe PDFs level of security

Adobe PDFs have multiple types of security controls that can be enabled by the owner of the document. The controls can be enforced with either a password or a certificate.

  1. Document encryption (enforced with a document open password)

    • Encrypt all document contents (most common)
    • Encrypt all document contents except metadata => Acrobat 6.0
    • Encrypt only file attachments => Acrobat 7.0
  2. Restrictive editing and printing (enforced with a permissions password)

    • Printing Allowed
    • Changes Allowed

The image below shows an Adobe PDF being encrypted with 256-Bit AES encryption. To open or print this PDF a password is required. When you open this document in Adobe Reader with the password, the title will state SECURED

password_level_encryption

This document requires a password to open with the Python modules that are mentioned in this answer. If you attempt to open an encrypted PDF with Adobe Reader. You should see this:

password_prompt

If you don't get this warning then the document either has no security controls enable or only has the restrictive editing and printing ones enabled.

The image below shows restrictive editing being enabled with a password in a PDF document. Note printing is enabled. To open or print this PDF a password is not required. When you open this document in Adobe Reader without a password, the title will state SECURED This is the same warning as the encrypted PDF that was opened with a password.

When you print this document to a new PDF the SECURED warning is removed, because the restrictive editing has been removed.

password_level_restrictive_editing

All Adobe products enforce the restrictions set by the permissions password. However, if third-party products do not support these settings, document recipients are able to bypass some or all of the restrictions set.

So I assume that the document that you are printing to PDF has restrictive editing enabled and does not have a password required to open enabled.

Concerning breaking PDF encryption

Neither PyPDF2 or PyPDF4 are designed to break the document open password function of a PDF document. Both the modules will throw the following error if they attempt to open an encrypted password protected PDF file.

PyPDF2.utils.PdfReadError: file has not been decrypted

The opening password function of an encrypted PDF file can be bypassed using a variety of methods, but a single technique might not work and some will not be acceptable because of several factors, including password complexity.

PDF encryption internally works with encryption keys of 40, 128, or 256 bit depending on the PDF version. The binary encryption key is derived from a password provided by the user. The password is subject to length and encoding constraints.

For example, PDF 1.7 Adobe Extension Level 3 (Acrobat 9 - AES-256) introduced Unicode characters (65,536 possible characters) and bumped the maximum length to 127 bytes in the UTF-8 representation of the password.


The code below will open a PDF with restrictive editing enabled. It will save this file to a new PDF without the SECURED warning being added. The tika code will parse the contents from the new file.

from tika import parser
import pikepdf

# opens a PDF with restrictive editing enabled, but that still 
# allows printing.
with pikepdf.open("restrictive_editing_enabled.pdf") as pdf:
  pdf.save("restrictive_editing_removed.pdf")

  # plain text output
  parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf", xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  print (pdf)

This code checks if a password is required for opening the file. This code be refined and other functions can be added. There are several other features that can be added, but the documentation for pikepdf does not match the comments within the code base, so more research is required to improve this.

# this would be removed once logging is used
############################################
import sys
sys.tracebacklimit = 0
############################################

import pikepdf
from tika import parser

def create_pdf_copy(pdf_file_name):
  with pikepdf.open(pdf_file_name) as pdf:
    new_filename = f'copy_{pdf_file_name}'
    pdf.save(new_filename)
    return  new_filename

def extract_pdf_content(pdf_file_name):
  # plain text output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  parsedPDF = parser.from_file(pdf_file_name, xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  return pdf

def password_required(pdf_file_name):
  try:
    pikepdf.open(pdf_file_name)

  except pikepdf.PasswordError as error:
    return ('password required')

  except pikepdf.PdfError as results:
    return ('cannot open file')


filename = 'decrypted.pdf'
password = password_required(filename)
if password != None:
  print (password)
elif password == None:
  pdf_file = create_pdf_copy(filename)
  results = extract_pdf_content(pdf_file)
  print (results)

For tabula-py, you can try password option with read_pdf. It depends on tabula-java's function so I'm not sure which encryption is supported though.


You can try to handle the error these files produce when you open these files without a password.

import pikepdf

def open_pdf(pdf_file_path, pdf_password=''):
    try:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path)

    except pikepdf._qpdf.PasswordError:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path, password=pdf_password)

    finally:
        return pdf_obj

You can use the returned pdf_obj for your parsing work. Also, you can provide the password in case you have an encrypted PDF.