analyzing-pdf-malware-with-pdfid
# analyzing-pdf-malware-with-pdfid This Claude Code skill automates malicious PDF triage using PDFiD, pdf-parser, and peepdf to detect embedded JavaScript, shellcode, exploit code, and suspicious PDF structures without opening the file. Use this when a suspicious PDF attachment requires rapid analysis to identify dangerous objects like auto-executing actions, embedded executables, or obfuscated code before sandbox testing or visual inspection.
git clone --depth 1 https://github.com/mukul975/Anthropic-Cybersecurity-Skills /tmp/analyzing-pdf-malware-with-pdfid && cp -r /tmp/analyzing-pdf-malware-with-pdfid/skills/analyzing-pdf-malware-with-pdfid ~/.claude/skills/analyzing-pdf-malware-with-pdfidSKILL.md
# Analyzing PDF Malware with PDFiD
## When to Use
- A suspicious PDF attachment has been flagged by email security or reported by a user
- You need to determine if a PDF contains embedded JavaScript, shellcode, or exploit code
- Triaging PDF documents before opening them in a sandbox or analysis environment
- Extracting embedded executables, scripts, or URLs from malicious PDF objects
- Analyzing PDF exploit kits targeting Adobe Reader or other PDF viewer vulnerabilities
**Do not use** for analyzing the rendered visual content of a PDF; this is for structural analysis of the PDF file format for malicious objects.
## Prerequisites
- Python 3.8+ with Didier Stevens' PDF tools installed (`pip install pdfid pdf-parser`)
- peepdf installed for interactive PDF analysis (`pip install peepdf`)
- pdftotext from poppler-utils for extracting text content safely
- YARA with PDF-specific rules for malware family identification
- Isolated analysis VM without a PDF reader installed (prevent accidental opening)
- CyberChef for decoding embedded Base64, hex, or deflate streams
## Workflow
### Step 1: Initial Triage with PDFiD
Scan the PDF for suspicious keywords and structures:
```bash
# Run PDFiD to identify suspicious elements
pdfid suspect.pdf
# Expected output analysis:
# /JS - JavaScript (HIGH risk)
# /JavaScript - JavaScript object (HIGH risk)
# /AA - Auto-Action triggered on open (HIGH risk)
# /OpenAction - Action on document open (HIGH risk)
# /Launch - Launch external application (HIGH risk)
# /EmbeddedFile - Embedded file (MEDIUM risk)
# /RichMedia - Flash content (MEDIUM risk)
# /ObjStm - Object stream (used for obfuscation)
# /URI - URL reference (contextual risk)
# /AcroForm - Interactive form (MEDIUM risk)
# Run with extra detail
pdfid -e suspect.pdf
# Run with disarming (rename suspicious keywords)
pdfid -d suspect.pdf
```
```
PDFiD Risk Assessment:
━━━━━━━━━━━━━━━━━━━━━
HIGH RISK indicators (any count > 0):
/JS, /JavaScript -> Embedded JavaScript code
/AA -> Automatic Action (triggers without user interaction)
/OpenAction -> Code runs when document is opened
/Launch -> Can launch external executables
/JBIG2Decode -> Associated with CVE-2009-0658 exploit
MEDIUM RISK indicators:
/EmbeddedFile -> Contains embedded files (could be EXE/DLL)
/RichMedia -> Flash/multimedia (Flash exploits)
/AcroForm -> Form with possible submit action
/XFA -> XML Forms Architecture (complex attack surface)
LOW RISK indicators:
/ObjStm -> Object streams (obfuscation technique)
/URI -> External URL references
/Page -> Number of pages (context only)
```
### Step 2: Parse PDF Structure with pdf-parser
Examine suspicious objects identified by PDFiD:
```bash
# List all objects referencing JavaScript
pdf-parser --search "/JavaScript" suspect.pdf
pdf-parser --search "/JS" suspect.pdf
# List all objects with OpenAction
pdf-parser --search "/OpenAction" suspect.pdf
# Extract a specific object by ID (example: object 5)
pdf-parser --object 5 suspect.pdf
# Extract and decompress stream content
pdf-parser --object 5 --filter --raw suspect.pdf
# Search for embedded files
pdf-parser --search "/EmbeddedFile" suspect.pdf
# List all objects with their types
pdf-parser --stats suspect.pdf
```
### Step 3: Extract and Analyze Embedded JavaScript
Pull out JavaScript code from PDF objects:
```bash
# Extract JavaScript using pdf-parser
pdf-parser --search "/JS" --raw --filter suspect.pdf > extracted_js.txt
# Alternative: Use peepdf for interactive JavaScript extraction
peepdf -f -i suspect.pdf << 'EOF'
js_analyse
EOF
# peepdf interactive commands for JS analysis:
# js_analyse - Extract and show all JavaScript code
# js_beautify - Format extracted JavaScript
# js_eval <object> - Evaluate JavaScript in sandboxed environment
# object <id> - Display object content
# rawobject <id> - Display raw object bytes
# stream <id> - Display decompressed stream
# offsets - Show object offsets in file
```
```python
# Python script for comprehensive PDF JavaScript extraction
import subprocess
import re
# Extract all streams and search for JavaScript
result = subprocess.run(
["pdf-parser", "--stats", "suspect.pdf"],
capture_output=True, text=True
)
# Find object IDs containing JavaScript references
js_objects = []
for line in result.stdout.split('\n'):
if '/JavaScript' in line or '/JS' in line:
obj_id = re.search(r'obj (\d+)', line)
if obj_id:
js_objects.append(obj_id.group(1))
# Extract each JavaScript-containing object
for obj_id in js_objects:
result = subprocess.run(
["pdf-parser", "--object", obj_id, "--filter", "--raw", "suspect.pdf"],
capture_output=True, text=True
)
print(f"\n=== Object {obj_id} ===")
print(result.stdout[:2000])
```
### Step 4: Analyze Embedded Shellcode
Extract and examine shellcode from PDF exploits:
```bash
# Extract raw stream data for shellcode analysis
pdf-parser --object 7 --filter --raw --dump shellcode.bin suspect.pdf
# Analyze shellcode with scdbg (shellcode debugger)
scdbg /f shellcode.bin
# Alternative: Use speakeasy for shellcode emulation
python3 -c "
import speakeasy
se = speakeasy.Speakeasy()
sc_addr = se.load_shellcode('shellcode.bin', arch='x86')
se.run_shellcode(sc_addr, count=1000)
# Review API calls made by shellcode
for event in se.get_report()['api_calls']:
print(f\"{event['api']}: {event['args']}\")
"
# Use CyberChef to decode hex/base64 encoded shellcode
# Input: Extracted stream data
# Recipe: From Hex -> Disassemble x86
```
### Step 5: Extract Embedded Files and URLs
Pull out embedded executables and linked resources:
```python
# Extract embedded files from PDF
import subprocess
import hashlib
# Find embedded file objects
result = subprocess.run(
["pdf-Create forensically sound bit-for-bit disk images using dd and dcfldd
Detect dangerous ACL misconfigurations in Active Directory using ldap3
Perform static analysis of Android APK malware samples using apktool
Parses API Gateway access logs (AWS API Gateway, Kong, Nginx) to detect
Analyze advanced persistent threat (APT) group techniques using MITRE
Queries Azure Monitor activity logs and sign-in logs via azure-monitor-query