Skill153 repo starsupdated 2mo ago

uap-release-analyzer

This skill analyzes folders of declassified UAP/UFO documents by extracting text, cataloging files by agency and source, identifying redaction patterns, surfacing named entities and locations, and detecting cross-document patterns, producing a standardized CSV inventory, per-file digest, entities JSON, and markdown report. Use it when analyzing any FOIA tranche, war.gov/UFO releases, FBI Vault materials, AARO publications, or any declassified government PDF corpus where structured document-level and corpus-level insights are needed.

View source Repository: uap-release-analyzer

Install in Claude Code

Copy

git clone https://github.com/ckpxgfnksd-max/uap-release-analyzer ~/.claude/skills/uap-release-analyzer

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# UAP / Declassified Release Analyzer

This skill turns a folder of declassified UAP/UFO documents into a structured analytic report. It was built from a real workflow against the May 2026 war.gov/UFO/ "PURSUE" tranche (162 files, 4,000+ pages, mixed FBI/DOW/NASA/DOS/NARA sources), so it's tuned to the quirks of that universe — but it generalizes to any tranche of FOIA-released government PDFs.

## When to use

Trigger on prompts like "analyze the UFO files I just downloaded", "build me a report on this UAP release", "what's in `~/Downloads/release_01/`?", "compare release 1 and release 2", "find redaction patterns in these FBI files", "summarize this AARO PDF", or whenever the user references a directory of declassified documents and wants any kind of summary, inventory, or pattern surfacing. Also trigger if the user just dumps a path and asks "what's interesting in here?" — this skill is the right tool.

## Why a skill

The work has a fixed shape that repeats across every new tranche:

1. Inventory — what files came down, sizes, page counts, which agency.
2. Text extraction — pull text where there is a text layer; flag the (often majority) of files that are scanned and need OCR.
3. Entity surfacing — locations, agencies, phenomena vocabulary, named people.
4. Redaction pattern analysis — which FOIA exemptions show up where, which files are most redacted.
5. Cross-document patterns — year clusters, agency × location heatmap, names that appear in 5+ files.
6. A standardized report the user can read in ten minutes.

Doing this freshly every time wastes effort and produces inconsistent outputs. The bundled scripts make every tranche analyzable the same way.

## The standard workflow

Run scripts in this order. Each writes intermediate artifacts that the next step consumes. They are **idempotent and incremental** — re-running on the same folder skips work that's already done.

```
release_root/
release_NN/ # the actual PDFs/PNGs/JPGs (input)
text/ # extracted text per PDF (created)
inventory.csv # one row per file (created)
analytics/ # aggregated outputs (created)
top_terms.csv
terms_by_agency.csv
entities.json
per_file_digest.csv
cross_doc.json
REPORT.md # human-readable analytic writeup (created)
```

**Step 1 — Inventory.** Run `scripts/inventory.py <release_root>`. This walks the release directory, classifies each file by filename prefix (see `references/agency_vocab.md`), reads PDF page counts, and writes `inventory.csv`. Don't write inventory by hand — the script handles encrypted PDFs, weird filenames with spaces or em-dashes, and files that pypdf can't open.

**Step 2 — Text extraction.** Run `scripts/extract_text.py <release_root> [start] [end]`. Extracts text via pdfplumber, writing one `.txt` per PDF into `text/`. Skips files that already have a non-empty `.txt`. Many FBI / NARA / older photo-PDFs have **no text layer** — those will produce 0-char files; that's expected and fine, the analytics treat them as "scanned, OCR needed". The optional `[start] [end]` slice arguments let you process in chunks if your sandbox has a per-call timeout (the war.gov FBI sections are 200+ pages each — extract them in batches of ~25 if running in a 45-second-call environment).

**Run scripts in the foreground of your turn**, not via background-and-end-turn patterns. The pipeline is fast enough (a few minutes from cold) that you can stay in-turn. If a single `extract_text.py` call would actually time out, prefer the `[start] [end]` chunking pattern over backgrounding — chunked calls each finish quickly, the script is idempotent, and progress is visible.

**Step 3 — Analytics.** Run `scripts/analyze.py <release_root>`. Reads the extracted text + inventory, then writes the contents of `analytics/`. This is fast even on 800K+ characters of text.

**Step 4 — Report.** Run `scripts/build_report.py <release_root>`. Reads inventory + analytics and writes a `REPORT.md` with the sections listed under "Report structure" below.

When the user just says "analyze the release at `<path>`", run all four in sequence with that path. When they ask a narrower question ("how many files?", "which file is most redacted?"), call only the relevant script or read the existing artifacts directly.

## Report structure

Always use this exact section order in `REPORT.md` so reports across tranches stay comparable. If a section has no data for this tranche, leave a one-line "no data" note — don't omit the heading.

```
# <Release name> — Raw Analytics
**Source:** ... · **Cleared for release:** ...
**Files in this analysis:** N of M (note any gaps)

## 1. Inventory — counts, total size, page counts, by agency
## 2. What's actually in the release — narrative summary of the major buckets
## 3. Where the activity is concentrated — top locations
## 4. Phenomena terminology — UAP/craft/orb/disc/etc. with counts
## 5. Agency cross-references — agencies named in text
## 6. Year clusters — when is this material from
## 7. Redactions — top markers + most-redacted files
## 8. Notable individual files
## 9. Cross-document patterns
## 10. What's missing / caveats — OCR gaps, files we couldn't pull, etc.
## 11. Files in this analysis — paths to inventory.csv / analytics/*
```

The "What's missing" section matters — it's what makes the report honest. Always call out files we couldn't OCR, files referenced on a source page but not downloaded, and heuristic limits of the entity extraction.

## Agency classification

Files are classified by filename prefix. The full vocabulary is in `references/agency_vocab.md`. The high-confidence prefixes from the war.gov universe:

- `65_hs1*`, `fbi-photo-*`, `usper-*`, `serial*`, `2024-04-30-*` → FBI
- `dow-uap*`, `western_us_event*` → DOW (Department of War)
- `nasa-uap*` → NASA
- `dos-uap*`, `059uap*` → DOS (St