How to extract text from a PDF file with Apache PDFBox

PdfBox 2.0.3 has a command line tool as well.

  1. Download jar file
  2. java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]
  -password  <password>        : Password to decrypt document
  -encoding  <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
  -console                     : Send text to console instead of file
  -html                        : Output in HTML format instead of raw text
  -sort                        : Sort the text before writing
  -ignoreBeads                 : Disables the separation by beads
  -debug                       : Enables debug output about the time consumption of every stage
  -startPage <number>          : The first page to start extraction(1 based)
  -endPage <number>            : The last page to extract(inclusive)
  <inputfile>                  : The PDF document to use
  [output-text-file]           : The file to write the text to

Using PDFBox 2.0.7, this is how I get the text of a PDF:

static String getText(File pdfFile) throws IOException {
    PDDocument doc = PDDocument.load(pdfFile);
    return new PDFTextStripper().getText(doc);

Call it like this:

try {
    String text = getText(new File("/home/me/test.pdf"));
    System.out.println("Text in PDF: " + text);
} catch (IOException e) {

Since user oivemaria asked in the comments:

You can use PDFBox in your application by adding it to your dependencies in build.gradle:

dependencies {
  compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'

Here's more on dependency management using Gradle.

If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

Maven dep:


Then the fucntion to get the pdf text as String.

private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
    try (PDDocument document = PDDocument.load(pdf)) {


        if (!document.isEncrypted()) {

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();

            PDFTextStripper tStripper = new PDFTextStripper();

            String pdfFileInText = tStripper.getText(document);
            // System.out.println("Text:" + st);

            // split by whitespace
            String lines[] = pdfFileInText.split("\\r?\\n");
            List<String> pdfLines = new ArrayList<>();
            StringBuilder sb = new StringBuilder();
            for (String line : lines) {
                sb.append(line + "\n");
            return sb.toString();

    return null;

I executed your code and it worked properly. Maybe your problem is related to FilePath that you have given to file. I put my pdf in C drive and hard coded the file path. Here is my code:

// PDFBox 2.0.8 require
// import;

public class PDFReader{
    public static void main(String args[]) throws IOException {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        File file = new File("C:/my.pdf");
        PDFParser parser = new PDFParser(new FileInputStream(file));
        try (COSDocument cosDoc = parser.getDocument()) {
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            String parsedText = pdfStripper.getText(pdDoc);


