How to detect malicious JavaScript in a PDF file?

I did some additional searching and found an interesting research-paper (easily readable and just 12 pages). The research is called Detecting Malicious JavaScript in PDF through Document Instrumentation.

In their research-paper they introduce a context-aware approach to detect and confine malicious JavaScript in PDF through static document instrumentation and runtime behavior monitoring.

The following quotes and figure give insight in how their developed detection system approached malicious PDF detection.

Detection architecture

Our system consists of two major components, front-end and back-end, working in two phases. In Phase-I, the front- end component statically parses the document, analyzes the structure, and finally instruments the PDF objects containing JavaScript. Then, in Phase-II when an instrumented document is opened, the back-end component detects suspicious behaviors of a PDF reader process in context of JavaScript execution and confines malicious attempts.

System Architecture

Phase-I Static Analysis and Instrumentation

For suspicious PDF, the front-end first parses the document structure and then decompresses the objects and streams. A set of static features are extracted in this process. When a document has been decompressed, the front-end will instrument it and add context monitoring code for JavaScript. In some cases, if the document is encrypted using an owner’s password, i.e., a mode of PDF in which the document is readable but non-modifiable, we need to remove the owner’s password. With the help of PDF password recovery tools like [28], this can be done easily and very fast.

Phase-II Static Runtime Detection

The back-end component works in two steps, runtime monitoring and runtime detection. When an instrumented PDF is loaded, the context monitoring code inside will cooperate with our runtime monitor, which tries to collect evidence of potential infection attempts. When Javascript executes to the end or a critical operation occurs, the runtime detector will compute a malscore. If the malscore exceeds a predefined threshold, the document will be classified as malicious.

The original URL to this research-paper (PDF) is https://cs.gmu.edu/~astavrou/research/Daiping_dsn14.pdf, also referenced from in the beginning of this answer. A mirror/copy of this document can be found here http://www.pdf-archive.com/2016/07/25/daiping-dsn14/daiping-dsn14.pdf.

Credits to: Daiping Liu and Haining Wang from College of William and Mary and Angelos Stavrou from George Mason University.