Skip to main content
ClaudeWave
Skill126 estrellas del repoactualizado 3d ago

repo-map

Ranked symbol map of a codebase within a token budget — a compact "what matters in this repo" before reading files. Use when starting work in an unfamiliar/large codebase, before a refactor or deep-review fan-out, when you need JIT context instead of dumping whole files, or asked "give me a map of this repo / where are the important functions / what's the structure". Zero-dependency (stdlib only); faithful reimplementation of Aider's repo-map (tree-sitter+PageRank → ranked tags).

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/AnastasiyaW/claude-code-config /tmp/repo-map && cp -r /tmp/repo-map/skills/development/repo-map ~/.claude/skills/repo-map
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# repo-map

Produces a **ranked list of the structurally-important symbol definitions** in a
codebase, capped to a token budget. The point: instead of dumping whole files
(wasteful, blows the context window), get a cheap "skeleton" of the repo —
the functions/classes everything else depends on — and read full files only
where it matters.

This is our zero-dependency port of [Aider's repo-map](https://aider.chat/docs/repomap.html).

## When to use

- **Onboarding** to an unfamiliar or large codebase — get the lay of the land first.
- **Before a refactor / `deep-review` / `workflow-orchestration` fan-out** — feed each
  subagent the map so it knows the important entry points without re-scanning.
- **JIT context** (principle 07 / `practice_context_engineering`) — load the map, not the files.
- Anytime the instinct is "let me read 20 files to understand this" → run this first.

## Usage

```bash
python scripts/repo_map.py [ROOT] [--budget-tokens N] [--top N] [--json] [--no-signature] [--max-files N]
```

- `ROOT` — repo root (default: cwd). If it's a git repo, only **tracked files** are
  scanned (honors `.gitignore`); otherwise a filtered directory walk is used.
- `--budget-tokens` (default 1024) — stop emitting once ~this many tokens are used
  (≈4 chars/token heuristic). Use 256–512 for a quick orientation, 2048+ for depth.
- `--top N` — hard cap on symbols before the budget applies.
- `--json` — machine-readable output (for piping into a workflow / another agent).
- `--no-signature` — emit `path:line: name` instead of the full signature line.

### Typical recipes

```bash
# Quick orientation in a new repo
python scripts/repo_map.py . --budget-tokens 512

# Feed a fan-out: JSON map of the most important 40 symbols
python scripts/repo_map.py /path/to/repo --top 40 --json > repo_map.json

# Focus a subdirectory (a single layer/service)
python scripts/repo_map.py /path/to/repo/services/auth --budget-tokens 800
```

## How ranking works

1. Extract definitions (functions/classes/types) and identifier references per file
   via per-language regexes (py/js/ts/tsx/go/rust/java/ruby/c/cpp/csharp/php/kotlin/swift).
2. Build a directed graph: edge `F → D` when file `F` references an identifier
   **defined** in file `D`. Rare identifiers (defined in few files) weigh more.
3. Run **PageRank** over the file graph → structurally-central files float up
   (shared utilities, core models, base classes).
4. Symbol score = `pagerank(def_file) × (1 + total_refs) × rarity`. Emit highest
   first until the token budget is hit.

So the map surfaces the code **everything else leans on**, not just the first files
alphabetically.

## Gotchas

- **Regex extraction, not tree-sitter.** Fidelity is good for ranking but not a
  parser. Exotic syntax, heavy macros, or unusual formatting can miss/misattribute a
  def. This is a deliberate trade for zero-install + runs-anywhere. To upgrade
  fidelity, replace `extract()` with `tree-sitter-language-pack` tags — the graph/
  PageRank stages stay identical. (Don't install tree-sitter casually: it trips the
  7-day supply-chain gate; gate it like any fresh dep.)
- **Token count is a heuristic** (`chars/4`), not a real tokenizer. Treat the budget
  as approximate; it errs slightly high on code with many short tokens.
- **Same name in many files** (e.g. `get`, `handle`) ranks each def site separately —
  expected, since each is a distinct definition. Use `--top` to trim noise.
- **Non-git dirs** fall back to a denylist walk (`node_modules`, `dist`, `venv`, …).
  If your build output lives somewhere unusual, it may get scanned — point ROOT at the
  source dir instead.
- **Generated code** (protobuf, `src/gen/`) can dominate ranks because it's referenced
  everywhere. Point ROOT at hand-written source, or filter after the fact.

## Troubleshooting

| Symptom | Cause | Fix |
|---|---|---|
| `0 files scanned` | ROOT has no recognized source extensions, or all gitignored | Check `git ls-files`; point ROOT at the source subdir |
| A key function is missing | Regex didn't match its signature style | Lower `--budget-tokens` pressure (raise budget) or accept the limitation; verify by `grep` |
| Map dominated by one vendored file | A `vendor/`/generated tree got scanned (non-git mode) | Run inside the git repo, or point ROOT at hand-written source |
| Wrong/old map | File moved; map is a point-in-time snapshot | Re-run — it's cheap and stateless |

## Verification

`scripts/repo_map.py` is stdlib-only and was verified to run on real trees (it
correctly ranks shared-utility files to the top via PageRank). Re-run on any repo to
confirm; there is no state to corrupt.