harness-engineering
Harness Engineering designs the control infrastructure for autonomous agents, defining what surfaces they can edit, how they receive feedback, where state persists, and how failures recover. Use this skill when building research loops, evaluation scaffolds, or PR-producing agents that must run safely for extended periods without corruption, and when implementing locked metrics, durable logs, novelty gates, and human approval boundaries.
git clone --depth 1 https://github.com/guanyang/open-agent-hub /tmp/harness-engineering && cp -r /tmp/harness-engineering/skills/harness-engineering ~/.claude/skills/harness-engineeringSKILL.md
# Harness Engineering Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective. ## When to Activate Activate this skill when: - Building autonomous research or experimentation loops - Designing an agent environment with locked metrics and editable code or content - Creating PR-producing or background agents - Evaluating whether an agent can safely run without frequent human prompts - Adding novelty, ablation, pruning, rollback, or durable logging to an agent workflow - Preventing agents from gaming benchmarks, weakening rubrics, or losing state across compaction Do not activate this skill for adjacent work owned by other skills: - General quality gates, regression suites, or outcome metrics without autonomous control surfaces: `evaluation`. - Tool schemas, response formats, and recovery errors for harness tools: `tool-design`. - Project-level task-model fit, pipeline shape, and cost planning: `project-development`. - Remote sandbox, warm-pool, and hosted session infrastructure: `hosted-agents`. ## Core Concepts ### Harness Boundary Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules. Use four surface classes: | Surface | Examples | Rule | | --- | --- | --- | | Locked | Eval metric, rubric, validation script, merge policy | Agent may read and propose changes, but cannot score itself with modified rules | | Editable | Skill draft, experiment file, prompt, config under test | Agent may mutate during the loop | | Append-only | Results log, research thread, rejected ideas | Agent may append, not rewrite | | Human-controlled | Merge, production deploy, credentials, destructive operations | Requires explicit human approval | ### Tight Feedback Loops Autonomy works when feedback is fast, unambiguous, and hard to game. Karpathy's `autoresearch` is the minimal pattern: one editable file, one locked evaluation file, fixed wall-clock budget, one scalar metric, git rollback, and a durable results log. The lesson is not that every harness needs one metric; it is that ambiguous feedback creates ambiguous autonomy. For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds. ### Durable State Long-running agents must externalize state. Store plans, source queues, results, failures, and handoffs in files so future agents can resume without relying on chat history. Prime Intellect's autonomous nanoGPT work showed the value of durable scratchpads and `THREAD.md`-style logs for recovery, monitoring, and audit. Use append-only logs for: - What was tried - What improved or failed - Why a candidate was kept, discarded, or routed to review - Which upstream sources were checked - What the next agent should do ### Search Discipline Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules: 1. Refresh upstream sources on a schedule. 2. Require novelty checks before spending large budgets. 3. Preserve rejected attempts to avoid rediscovery. 4. Run leave-one-out pruning when a stack has multiple additions. 5. Reward simplification when quality is equal. 6. Use separate verification before promotion. ### Mechanism Registry For research-to-skill systems, track accepted mechanisms separately from prose. A mechanism record should include a stable `mechanism_id`, `owning_skill`, `status`, activation scenario, behavior change, evidence, and failure modes. Novelty gates should compare against this registry before using broader corpus overlap, because keyword overlap catches stale phrasing while mechanism comparison catches real duplication. ### Governance Autonomous agents may prepare PRs, but governance must be explicit. They can draft changes, run checks, and write PR summaries. They should not merge, deploy, or push without human approval unless the user has explicitly granted that permission for the specific action. ## Detailed Topics ### Autoresearch-Style Loop Use this pattern when optimizing an artifact against a stable evaluator: ```text read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint -> run evaluator -> log result -> keep if better -> discard or rollback if worse -> repeat ``` Required properties: - The evaluator is outside the editable surface. - The feedback cadence is fixed enough to compare attempts. - Failed attempts leave an audit trail. - Rollback is cheap. - The agent has a policy for crashes and timeouts. ### Research-To-Skill Loop Use this pattern when sources become skill changes: ```text discover -> retrieve -> gate -> score -> extract mechanism -> map to existing or new skill -> draft proposal -> validate structure -> prepare PR -> human review ``` The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta. ### Metric Gaming Resistance Assume an optimizing agent will learn the harness. Guard against: - Editing evaluation code or rubrics and then using the new version for self-approval - Adding verbose content that pleases a judge but harms skill activation - Citing unretrieved sources - Optimizing aggregate scores while failing a critical dimension - Avoiding failed results in the log Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review. ### Monitoring Agents Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly t
Principal Software Architect specializing in system design, database modeling, API engineering, and system resilience.
Principal Diagnostics Engineer specializing in root cause analysis, error troubleshooting, and hotfixes.
Principal Clean Code Specialist specializing in code simplification, performance tuning, and refactoring loops.
Senior Technical Lead and Security Auditor specializing in code quality, correctness, and security audits.
Senior QA Automation Engineer specializing in unit, integration, and E2E test suite creation.
Run when user calls /commit or asks to generate a commit message. Analyzes staged changes and writes a structured commit message.
Run when user calls /review. Analyzes local changes and runs a comprehensive code review using the agent-reviewer prompt.
Run when user calls /test-tdd. Scans modified files, locates their corresponding unit/integration test suites, and runs them.