Malicious PDF documents are commonly leveraged in spear phishing attacks because it’s a very common file format used in day to day business.  In addition, PDFs offer a lot of features for embedding content (JavaScript, Flash, Shellcode, etc.) which is ideal for the carrier mechanism to deliver the malware.

Below is a diagram of the PDF file format:

pdf_objects

A common way malicious PDFs execute code on a victim machine is through embedded JavaScript.  The JavaScript is commonly used to fill the heap with NOPs + Shellcode.  This technique is effective when the attacker isn’t sure where EIP will land when a buffer overflow condition occurs.  By having large chucks of NOPs leading to shellcode they have a high probability that EIP will land in the NOP sled and slide its way down to execute the malicious shellcode.  Below is an example of the memory heap before and after the JavaScript is executed:

heap_spray

Normally I begin my analysis of a PDF by statically analyzing the embedded objects.  PDFs have several types of objects that can be embedded, but the most interesting are streams because this is where embedded content will be stored (Scripts, Shellcode, Files, etc.).  Below are a few interesting keywords I look for when statically analyzing a PDF:

  • /AA /OpenAction – Automatically executes an action or script
  • /JS, /JavaScript  – embeds JavaScript code in an object
  • /Richmedia – can be used to embed flash in PDF
  • /ObjStm – Can be used to hide objects
  • /Names, /AcroForm, /Action – Can launch scripts or actions
  • /GoTo – changes the view to a specified destination in the PDF file
  • /Launch – Runs a program or opens a document
  • /URI – accesses a resource by its URL
  • /SubmitForm, /GoToR – can send data to a URL

Some tools that are useful for analyzing PDFs to check for these keywords are below:

  • pdfid.py – Python script used to count risky keywords
  • pdf-parser.py – Python script to search and examine the embedded content
  • Jsunpack-n – Tool to analyze PDFs, SWF, HTML, JavaScript, and pcap
  • Origami Framework – Suite of tools to analyze PDF documents written in Ruby
  • pyew – Tool written in Python that has several modules to parse PDF files
  • SpiderMonkey – Tool for analyzing and deobsfucating JavaScript

Below is an example of scanning a PDF for embedded objects with pdfid.py.  This PDF has some embedded objects that indicate embedded JavaScript:

pdfid_1

Next we can use pdf-parser.py to attempt to identify the object that has the embedded JavaScript.  In the screenshot below we can see that object 31 is referencing object 32:

pdf_parser_1

 

Now we can use the switch “–object” to further inspect object 32.  We can see in this situation that the object stream is compressed with “FlateDecode”:

pdf_parser_2

To decompress the object we can use the switches “–filter” and “–raw” to expose the javaScript that is used to perform the heap spraying:

pdf_parser_3

At this point we have determined this PDF is malicious.  We can continue our analysis dynamically in an analysis VM or disarm the PDF by modifying the keywords with pdfid.py:

pdf_disarm