Best tool for text extraction from PDF in Python 3.4

You need to install PyPDF2 module to be able to work with PDFs in Python 3.4. PyPDF2 cannot extract images, charts or other media but it can extract text and return it as a Python string. To install it run pip install PyPDF2 from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.

>>> import PyPDF2
>>> pdfFileObj = open('my_file.pdf','rb')     #'rb' for read binary mode
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
56
>>> pageObj = pdfReader.getPage(9)          #'9' is the page number
>>> pageObj.extractText()

last statement returns all the text that is available in page-9 of 'my_file.pdf' document.


Complementing @Sarah's answer. PDFMiner is a pretty good choice. I have been using it from quite some time, and until now it works pretty good on extracting the text content from a PDF. What I did is to create a function which uses the CLI client from pdfminer, and then it saves the output into a variable (which I can use later on somewhere else). The Python version I am using is 3.6, and the function works pretty good and does the required job, so maybe this can work for you:

def pdf_to_text(filepath):
    print('Getting text content for {}...'.format(filepath))
    process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    stdout, stderr = process.communicate()

    if process.returncode != 0 or stderr:
        raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))

    return stdout.decode('utf-8')

You will have to import the subprocess module of course: import subprocess


pdfminer.six ( https://github.com/pdfminer/pdfminer.six ) has also been recommended elsewhere and is intended to support Python 3. I can't personally vouch for it though, since it failed during installation MacOS. (There's an open issue for that and it seems to be a recent problem, so there might be a quick fix.)

Tags:

Pdf

Python 3.X