Skill126 estrellas del repoactualizado 3d ago
harness-audit
Score a project's agent harness across 5 subsystems (Instructions / State / Verification / Scope / Lifecycle), identify the bottleneck, and produce a prioritized improvement plan. Use when assessing if a project is ready to graduate to [LONG-RUN] status, when an agent keeps failing despite good models, or when adopting our stack on a new codebase.
Instalar en Claude Code
Copiargit clone --depth 1 https://github.com/AnastasiyaW/claude-code-config /tmp/harness-audit && cp -r /tmp/harness-audit/skills/operational/harness-audit ~/.claude/skills/harness-auditDespués abre una sesión nueva de Claude Code; el skill carga automáticamente.
Definición
SKILL.md
# Harness Audit
Score a project's agent harness across five subsystems and tell the user which one to fix first.
**Source**: Five-subsystem framework adapted from [Learn Harness Engineering](https://walkinglabs.github.io/learn-harness-engineering/) (walkinglabs, MIT). Adapted to our concrete stack: CLAUDE.md, `.claude/rules/`, PROBLEMS.md, `feature_list.json`, `init.sh`, hooks, handoffs, chronicles.
## What This Skill Does
Given a project directory, produces a scorecard like this:
```
=== Harness Audit: project-xyz ===
Instructions 4/5 ✓ CLAUDE.md present, modular rules in .claude/rules/
✗ No project-level REVIEW.md for PR review guidance
State 2/5 ✓ .claude/handoffs/ exists (3 files)
✗ No PROBLEMS.md - issues scattered in handoffs
✗ No feature_list.json - scope state not machine-readable
Verification 3/5 ✓ Tests run, pytest configured
✗ No init.sh - new sessions take 15+ min to bootstrap
✗ 3-layer gate not documented in CLAUDE.md
Scope 3/5 ✓ no-pre-existing-evasion principle in CLAUDE.md
✗ No WIP=1 (no feature_list.json to enforce it)
✗ Definition of Done not explicit
Lifecycle 2/5 ✗ No SessionStart hook (no .claude/settings.json)
✗ No Stop hook for clean-state check
~ Manual cleanup convention exists but not enforced
Bottleneck: State (2/5) — lack of structured progress tracking
Top 3 improvements (in order):
1. Create PROBLEMS.md (1h) ↗ State 2→4
Template: claude-code-skills/templates/long-run-project/ has examples
2. Create feature_list.json + init.sh (30min) ↗ State 2→5, Verification 3→4
Drop-in: claude-code-skills/templates/long-run-project/
3. Add Stop hook stop-test-gate.py (15min) ↗ Lifecycle 2→4
Source: claude-code-skills/hooks/stop-test-gate.py
After top 3: Instructions 4 + State 5 + Verification 4 + Scope 3 + Lifecycle 4 = 20/25 (was 14/25)
```
The skill does **not** make changes. It produces the scorecard. The user decides whether to apply recommendations.
---
## The Five Subsystems (Our Adaptation)
| Subsystem | Concrete files/conventions in our stack |
|---|---|
| **Instructions** | `CLAUDE.md` (root + `~/.claude/`), `.claude/rules/*.md` (project), `~/.claude/rules/*.md` (global), optional `REVIEW.md` |
| **State** | `PROBLEMS.md`, `feature_list.json`, `.claude/handoffs/`, `.claude/chronicles/` |
| **Verification** | `init.sh`, tests configured, 3-Layer Validation Gate referenced in CLAUDE.md, Proof Loop usage |
| **Scope** | `no-pre-existing-evasion.md` rule applied, WIP=1 enforced (one `in-progress` in feature_list.json), explicit Definition of Done |
| **Lifecycle** | SessionStart hooks, Stop hooks (stop-test-gate, check-problems-md), cleanup convention |
See `references/checklist-per-subsystem.md` for per-subsystem concrete checks.
See `references/scoring-rubric.md` for how to interpret 1-5 scores.
---
## How to Run an Audit
### Phase 1 — Gather
Read these files in order (skip silently if missing):
1. `CLAUDE.md` in project root
2. `AGENTS.md` in project root (some projects use this name)
3. `.claude/rules/*.md` (project-level rules)
4. `.claude/settings.json` and `.claude/settings.local.json` (hooks config)
5. `PROBLEMS.md` in root
6. `feature_list.json` in root
7. `init.sh` in root (and `Makefile` / `package.json` scripts as fallback)
8. `.claude/handoffs/` (count files, check `INDEX.md` existence)
9. `.claude/chronicles/` (count files)
10. Sample test config: `pytest.ini` / `package.json` test script / `Cargo.toml`
Use `Glob` + `Read`. Don't `grep` across entire codebase — this is metadata audit, not code review.
### Phase 2 — Score
For each subsystem, run the checks in `references/checklist-per-subsystem.md`. Each check is a binary pass/fail. Score:
- **5** = all checks pass + documented + consistently followed
- **4** = most checks pass, 1-2 gaps
- **3** = covers basics, missing polish
- **2** = weak, several checks fail
- **1** = missing or actively harmful
For each subsystem, list:
- ✓ what's present and working
- ✗ what's missing or broken
- ~ partial / unclear
### Phase 3 — Identify Bottleneck
The lowest-scoring subsystem is the bottleneck. **Even if other subsystems are weaker by absolute count of checks**, the lowest score is the one to fix first because it limits the value of the rest.
Tie-breaker (multiple subsystems at same low score): pick the one whose improvement *unlocks* progress in others. State usually wins ties because feature_list.json + PROBLEMS.md unlock Verification and Scope checks.
### Phase 4 — Prioritized Improvement Plan
Output exactly 3 next steps in order, each with:
- **Effort** estimate (15min / 30min / 1h / 1d)
- **Subsystem(s)** it improves and by how much (2→4, etc.)
- **Pointer** to a template or example in `claude-code-skills/` if available
The 3 steps must:
1. Address the bottleneck first
2. Each step independently shippable (no item depends on a later one)
3. Together raise the total score by at least 4 points (out of 25)
Do not give more than 3. Three is enough scope for one focused session.
---
## Output Format
Use the visual scorecard format shown at the top of this skill. Sections:
1. **Header**: `=== Harness Audit: <project-name> ===` (one line)
2. **Scorecard**: 5 lines, one per subsystem, with score + ✓/✗ findings
3. **Bottleneck**: one line naming the subsystem and score
4. **Top 3 improvements**: numbered list with effort + impact + pointer
5. **Projected total**: optional, only if user asked for "after" state
Keep the entire output under 50 lines. The user is scanning for next steps, not reading an essay. Detail goes into the per-subsystem checklist file, not the audit output.
---
## What This Skill Is NOT
- **Not a code review** — does not look at source code quality
- **Not a security audit** — does not check for vulnerabilities (use `/security-review` instead)
- **No