Converting djvu to pdf AND preserving table of contents , how is it possible?

update: user3124688 has coded up this process in the script dpsprep.


I don't know of any tools that will do the conversion for you. You certainly should be able to do it, but it might take a little work. I'll outline the basic process. You'll need the open source command line utilities pdftk and djvused (part of DjVuLibre). These are available from your package manager (GNU/Linux) or their websites (Windows, OS X).

  • step 1: convert the file text

    First, use any tool to convert the DJVU file to a PDF (without bookmarks).

    Suppose the files are called filename.djvu and filename.pdf.

  • step 2: extract DJVU outline

    Next, output the DJVU outline data to a file, like this:

    djvused "filename.djvu" -e 'print-outline' > bmarks.out
    

    This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:

    file ::= (bookmarks
               <bookmark>*)
    bookmark ::= (name
                   page
                   <bookmark>*)
    name ::= "<character>*"
    page ::= "#<digit>+"
    

    For example:

    (bookmarks
      ("bmark1"
        "#1")
      ("bmark2"
        "#5"
        ("bmark2subbmark1"
          "#6")
        ("bmark2subbmark2"
          "#7"))
      ("bmark3"
        "#9"
        ...))
    
  • step 3: convert DJVU outline to PDF metadata format

    Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:

    file ::= <entry>*
    entry ::= BookmarkBegin
              BookmarkTitle: <title>
              BookmarkLevel: <number>
              BookmarkPageNumber: <number>
    title ::= <character>*
    

    So our example would become:

     BookmarkBegin
     BookmarkTitle: bmark1
     BookmarkLevel: 1
     BookmarkPageNumber: 1
     BookmarkBegin
     BookmarkTitle: bmark2
     BookmarkLevel: 1
     BookmarkPageNumber: 5
     BookmarkBegin
     BookmarkTitle: bmark2subbmark1
     BookmarkLevel: 2
     BookmarkPageNumber: 6
     BookmarkBegin
     BookmarkTitle: bmark2subbmark2
     BookmarkLevel: 2
     BookmarkPageNumber: 7
     BookmarkBegin
     BookmarkTitle: bmark3
     BookmarkLevel: 1
     BookmarkPageNumber: 9
    

    Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.

  • step 4: extract PDF metadata and splice in converted bookmarks

    Once you've got the converted list, output the PDF metadata from your converted PDF file:

    pdftk "filename.pdf" dump_data > pdfmetadata.out
    

    Now, open the file and find the line that begins: NumberOfPages:

    insert the converted bookmarks after this line. Save the new file as pdfmetadata.in

  • step 5: create PDF with bookmarks

    Now we can create a new PDF file incorporating this metadata:

    pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf
    

    The file out.pdf should be a copy of your PDF with the bookmarks imported from the DJVU file.


Based on the very clear outline above given by user @pyrocrasty (thank you!), I have implemented a DJVU to PDF converter which preserves both OCR'd text and the bookmark structure. You may find it here:

https://github.com/kcroker/dpsprep

Acknowledgements for the OCR data go to @zetah on the Ubuntu forums!