How do you programmatically redact PDF FIles?

In order to properly redact a PDF, you need to Alter The Content Stream. This is Very Hard.

If you can find the portion of the content stream that draws the text you want removed, you're halfway there.

The other half is figuring out how to change the content stream such that you don't modify the rest of the document. If the next text draw operator is proceeded by a "tm" command (set the text matrix, which absolutely positions the next piece of text), it's easy. If not... you have to calculate the exact width of the text you're replacing (several different PDF libraries can do this), and alter the drawing commands to skip over that much stuff.

For Example:

BT
/F1 10 Tf
1 0 0 1 30 720 Tm
(Here's some text, and you only want to REDACT that upper case "redact" over there)Tj
*
(This text is positioned relative to the previous line)Tj
1 0 0 1 30 650 Tm
(This text is positioned absolutely, starting at 30, 650)Tj

So you'd have to break up that first (...)Tj line into (Here's some text, and you only want to)Tj, N 0 Td, and (that upper case "redact" over there)Tj... where the 'N' properly adjusts the position of the following text drawing operation such that it lands in EXACTLY THE SAME SPOT. So you'd need to know the precise width of " REDACT " using the font resource /F1 (whatever that turned out to be), sized to 10 points.

Just to make your life more exciting, you have to worry about kerned text too. You can provide little spacing adjustments inline with text thusly:

(This is taken from the first text drawn in the PDF Spec)

[(Adobe Sys)5(t)1(ems Inc)5(orporated)5( 20)5(08 \226 All rights)5( reser)-9(ved)]TJ

To properly redact "Incorporated", you need to determine that it's been split across two strings, and adjust the positioning of the string following it so it's in Exactly The Same Spot.

And strings can be <DEADBEEF> hex values rather than (plain old ascii).

Get the idea? And I haven't covered all the possibilities here, just the most common ones.

Like I said: This is Very Hard.


There's an acrobat plugin called Appligent Redax (no connection) that lets you draw annotations (or generate them via templates, regex, etc) and then run their code to handle the redaction. It should be possible to programmatically create their annotations and perhaps even activate their plugin: JS in a document can run a menu item.


Here's a web page that goes through what you need to do. As others mentioned you have to do this in Javascript as that's what Acrobat's native scripting is.

http://acrobatusers.com/tutorials/2008/07/auto_redaction_with_javascript

While I use Acrobat regularly I've surprisingly never had a need to script it. I checked the dictionary for it and it looks like you'll have to write Javascript file, save it and then open it with Applescript if that's what you want to do (say as a service).

tell application "Adobe Acrobat Professional"
   do script "this.info.title;"
end tell

Here's Adobe's Javascript for Acrobat documentation

http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=JavaScript_SectionPage.70.1.html


You can use GroupDocs.Redaction for .NET to programmatically redact text in the PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction of the text. This is how you can perform the exact phrase redaction.

using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
     doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
     // Save the document to "*_Redacted.*" file.
     doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false }); 
} 

Disclosure: I work as Developer Evangelist at GroupDocs.