mapping-documents
Generate navigable semantic maps from PDF documents. Extracts section structure via font analysis, then runs LLM extraction per section for claims, symbols, and dependencies — all page-anchored. Produces _MAP.md (progressive disclosure), .symbols.json (definition index), .anchors.json (claim references), and a _USAGE.md snippet for CLAUDE.md. Use when analyzing papers, specs, or legal docs; when asked to "map this document", "index this PDF", "what does this paper say"; or when a coding agent needs grounded reference material from a PDF source. Analogous to mapping-codebases but for prose documents.
git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/mapping-documents && cp -r /tmp/mapping-documents/mapping-documents ~/.claude/skills/mapping-documentsSKILL.md
# Mapping Documents
Generate `_MAP.md` files providing hierarchical document structure with semantic annotations. Maps show section summaries, typed claims (result/definition/method/caveat/open-question), symbol definitions, and cross-section dependencies — all anchored to page numbers.
The structural analog to `mapping-codebases`: tree-sitter parses code via grammar, docmap parses documents via font analysis + LLM extraction.
## Installation
```bash
pip install pdfplumber anthropic --break-system-packages -q
```
## Generate Maps
```bash
# Full run (structure + semantic extraction via Claude API)
python /mnt/skills/user/mapping-documents/scripts/docmap.py paper.pdf \
--out docs/ --genre paper --workers 4
# Structure only (no API calls, no cost)
python /mnt/skills/user/mapping-documents/scripts/docmap.py paper.pdf \
--out docs/ --structure-only
```
API key resolution: `--api-key` flag > `ANTHROPIC_API_KEY` env > `API_KEY` env.
## Output Artifacts
Four files, forming a three-layer progressive-disclosure stack:
```
CLAUDE.md / project instructions ← curated invariants (you write this)
↕ (_USAGE.md bridges the gap)
_MAP.md + JSON indexes ← navigable document map (docmap generates)
↕
raw PDF ← the source document
```
| File | Purpose | When to read |
|------|---------|--------------|
| `{stem}_USAGE.md` | Snippet for pasting into CLAUDE.md / AGENTS.md / project knowledge. Describes the reading order and JSON query patterns. | Once, at setup |
| `{stem}_MAP.md` | Section map: TOC with summaries, typed claims, defined symbols, dependencies. All page-anchored. | Any question about what the document says |
| `{stem}.symbols.json` | Flat symbol index: where defined, where used, what it means. | "Where is X defined?" |
| `{stem}.anchors.json` | Every claim: section ID, type, text, page number. | "What caveats exist?" / "What does §3 claim?" |
## After Generating: Wire It Up
Generating the map is step 1. Step 2 is telling the agent the map exists.
**For a code repo (CLAUDE.md / AGENTS.md):**
```bash
# Paste the generated usage snippet into your agent instructions
cat docs/paper_USAGE.md >> CLAUDE.md
```
**For Claude.ai project knowledge:**
Upload `_MAP.md` as a project knowledge file, or paste the `_USAGE.md` content into project instructions.
The `_USAGE.md` snippet includes copy-pasteable query commands for the JSON indexes. Replace `QUERY` and `SECTION_ID` placeholders with actual values.
## Navigate Via Maps
After generating and wiring up, use the map for navigation — read `_MAP.md`, not the raw PDF.
**Workflow:**
1. Read `_USAGE.md` block in CLAUDE.md for orientation
2. Read top-level TOC in `_MAP.md` for structure and section summaries
3. Drill into relevant sections for typed claims and symbol definitions
4. Query `.symbols.json` for "where is X defined?" lookups
5. Query `.anchors.json` for claim filtering by type or section
6. Read the raw PDF only when exact wording or figures are needed
**Querying the JSON indexes:**
```bash
# Symbol lookup
python3 -c "import json; [print(f'§{s[\"defined_in\"]} p.{s[\"defined_at_page\"]}') \
for s in json.load(open('docs/paper.symbols.json')) if 'edl' in s['symbol']]"
# All caveats in the document
python3 -c "import json; [print(f'p.{c[\"page\"]} {c[\"text\"]}') \
for c in json.load(open('docs/paper.anchors.json')) if c['type'] == 'caveat']"
# All claims in a section
python3 -c "import json; [print(f'[{c[\"type\"]}] {c[\"text\"]}') \
for c in json.load(open('docs/paper.anchors.json')) if c['section'] == '4.3']"
```
## Genre Support
Genre controls the claim taxonomy used in semantic extraction.
| Genre | Claim types | Best for |
|-------|-------------|----------|
| `paper` (default) | definition, result, method, claim, caveat, open-question | Academic papers, arXiv preprints |
| `spec` | requirement, definition, constraint, example, note | RFCs, API specs, technical standards |
| `legal` | definition, obligation, right, exception, condition, reference | Contracts, policy documents, regulations |
## Limitations (v0.1.x)
- **PDF-only.** No DOCX, HTML, or plain text input yet.
- **Single-column layout assumed.** Two-column papers may mis-order text within sections.
- **No caching.** Re-running re-extracts everything.
- **No citation cross-referencing.**
- **Genre must be specified manually.**
- **Semantic extraction can hallucinate.** Every claim is page-anchored, but the page number comes from the LLM. Verify critical claims against the source.
## CLI Reference
```
python docmap.py paper.pdf [options]
Options:
--genre {paper,spec,legal} Claim taxonomy (default: paper)
--structure-only Skip LLM pass (free, fast)
--out DIR Output directory (default: .)
--api-key KEY Anthropic API key
--model MODEL Model (default: claude-sonnet-4-6)
--workers N Parallel workers (default: 4)
--no-usage-snippet Skip _USAGE.md generation
-v Verbose structural parsing
```GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
Securely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.
Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
>-
>-
Browse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.
Generate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".
Analyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.