Skill26.7k repo starsupdated 1mo ago

analyzing-pdf-malware-with-pdfid

# analyzing-pdf-malware-with-pdfid This Claude Code skill automates malicious PDF triage using PDFiD, pdf-parser, and peepdf to detect embedded JavaScript, shellcode, exploit code, and suspicious PDF structures without opening the file. Use this when a suspicious PDF attachment requires rapid analysis to identify dangerous objects like auto-executing actions, embedded executables, or obfuscated code before sandbox testing or visual inspection.

View source Repository: Anthropic-Cybersecurity-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mukul975/Anthropic-Cybersecurity-Skills /tmp/analyzing-pdf-malware-with-pdfid && cp -r /tmp/analyzing-pdf-malware-with-pdfid/skills/analyzing-pdf-malware-with-pdfid ~/.claude/skills/analyzing-pdf-malware-with-pdfid

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Analyzing PDF Malware with PDFiD

## When to Use

- A suspicious PDF attachment has been flagged by email security or reported by a user
- You need to determine if a PDF contains embedded JavaScript, shellcode, or exploit code
- Triaging PDF documents before opening them in a sandbox or analysis environment
- Extracting embedded executables, scripts, or URLs from malicious PDF objects
- Analyzing PDF exploit kits targeting Adobe Reader or other PDF viewer vulnerabilities

**Do not use** for analyzing the rendered visual content of a PDF; this is for structural analysis of the PDF file format for malicious objects.

## Prerequisites

- Python 3.8+ with Didier Stevens' PDF tools installed (`pip install pdfid pdf-parser`)
- peepdf installed for interactive PDF analysis (`pip install peepdf`)
- pdftotext from poppler-utils for extracting text content safely
- YARA with PDF-specific rules for malware family identification
- Isolated analysis VM without a PDF reader installed (prevent accidental opening)
- CyberChef for decoding embedded Base64, hex, or deflate streams

## Workflow

### Step 1: Initial Triage with PDFiD

Scan the PDF for suspicious keywords and structures:

```bash
# Run PDFiD to identify suspicious elements
pdfid suspect.pdf

# Expected output analysis:
# /JS           - JavaScript (HIGH risk)
# /JavaScript   - JavaScript object (HIGH risk)
# /AA           - Auto-Action triggered on open (HIGH risk)
# /OpenAction   - Action on document open (HIGH risk)
# /Launch       - Launch external application (HIGH risk)
# /EmbeddedFile - Embedded file (MEDIUM risk)
# /RichMedia    - Flash content (MEDIUM risk)
# /ObjStm       - Object stream (used for obfuscation)
# /URI          - URL reference (contextual risk)
# /AcroForm     - Interactive form (MEDIUM risk)

# Run with extra detail
pdfid -e suspect.pdf

# Run with disarming (rename suspicious keywords)
pdfid -d suspect.pdf
```

```
PDFiD Risk Assessment:
━━━━━━━━━━━━━━━━━━━━━
HIGH RISK indicators (any count > 0):
  /JS, /JavaScript  -> Embedded JavaScript code
  /AA               -> Automatic Action (triggers without user interaction)
  /OpenAction       -> Code runs when document is opened
  /Launch           -> Can launch external executables
  /JBIG2Decode      -> Associated with CVE-2009-0658 exploit

MEDIUM RISK indicators:
  /EmbeddedFile     -> Contains embedded files (could be EXE/DLL)
  /RichMedia        -> Flash/multimedia (Flash exploits)
  /AcroForm         -> Form with possible submit action
  /XFA              -> XML Forms Architecture (complex attack surface)

LOW RISK indicators:
  /ObjStm           -> Object streams (obfuscation technique)
  /URI              -> External URL references
  /Page             -> Number of pages (context only)
```

### Step 2: Parse PDF Structure with pdf-parser

Examine suspicious objects identified by PDFiD:

```bash
# List all objects referencing JavaScript
pdf-parser --search "/JavaScript" suspect.pdf
pdf-parser --search "/JS" suspect.pdf

# List all objects with OpenAction
pdf-parser --search "/OpenAction" suspect.pdf

# Extract a specific object by ID (example: object 5)
pdf-parser --object 5 suspect.pdf

# Extract and decompress stream content
pdf-parser --object 5 --filter --raw suspect.pdf

# Search for embedded files
pdf-parser --search "/EmbeddedFile" suspect.pdf

# List all objects with their types
pdf-parser --stats suspect.pdf
```

### Step 3: Extract and Analyze Embedded JavaScript

Pull out JavaScript code from PDF objects:

```bash
# Extract JavaScript using pdf-parser
pdf-parser --search "/JS" --raw --filter suspect.pdf > extracted_js.txt

# Alternative: Use peepdf for interactive JavaScript extraction
peepdf -f -i suspect.pdf << 'EOF'
js_analyse
EOF

# peepdf interactive commands for JS analysis:
# js_analyse          - Extract and show all JavaScript code
# js_beautify         - Format extracted JavaScript
# js_eval <object>    - Evaluate JavaScript in sandboxed environment
# object <id>         - Display object content
# rawobject <id>      - Display raw object bytes
# stream <id>         - Display decompressed stream
# offsets             - Show object offsets in file
```

```python
# Python script for comprehensive PDF JavaScript extraction
import subprocess
import re

# Extract all streams and search for JavaScript
result = subprocess.run(
    ["pdf-parser", "--stats", "suspect.pdf"],
    capture_output=True, text=True
)

# Find object IDs containing JavaScript references
js_objects = []
for line in result.stdout.split('\n'):
    if '/JavaScript' in line or '/JS' in line:
        obj_id = re.search(r'obj (\d+)', line)
        if obj_id:
            js_objects.append(obj_id.group(1))

# Extract each JavaScript-containing object
for obj_id in js_objects:
    result = subprocess.run(
        ["pdf-parser", "--object", obj_id, "--filter", "--raw", "suspect.pdf"],
        capture_output=True, text=True
    )
    print(f"\n=== Object {obj_id} ===")
    print(result.stdout[:2000])
```

### Step 4: Analyze Embedded Shellcode

Extract and examine shellcode from PDF exploits:

```bash
# Extract raw stream data for shellcode analysis
pdf-parser --object 7 --filter --raw --dump shellcode.bin suspect.pdf

# Analyze shellcode with scdbg (shellcode debugger)
scdbg /f shellcode.bin

# Alternative: Use speakeasy for shellcode emulation
python3 -c "
import speakeasy

se = speakeasy.Speakeasy()
sc_addr = se.load_shellcode('shellcode.bin', arch='x86')
se.run_shellcode(sc_addr, count=1000)

# Review API calls made by shellcode
for event in se.get_report()['api_calls']:
    print(f\"{event['api']}: {event['args']}\")
"

# Use CyberChef to decode hex/base64 encoded shellcode
# Input: Extracted stream data
# Recipe: From Hex -> Disassemble x86
```

### Step 5: Extract Embedded Files and URLs

Pull out embedded executables and linked resources:

```python
# Extract embedded files from PDF
import subprocess
import hashlib

# Find embedded file objects
result = subprocess.run(
    ["pdf-