Skill116 estrellas del repoactualizado 5d ago

harness-engineering

Design runtime infrastructure around AI agents — permissions, tools, feedback loops, observability. Use when deploying agents to production or designing multi-agent systems.

Ver fuente Repositorio: third-brain-v5-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/Mark393295827/third-brain-v5-skills /tmp/harness-engineering && cp -r /tmp/harness-engineering/skills/harness-engineering ~/.claude/skills/harness-engineering

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Harness Engineering

Design the system *around* AI agents for reliable, safe production use.

Harness Engineering protects the quality ceiling of Agentic Engineering. It turns fast agent output into controlled execution through permissions, observability, recovery, and adversarial validation.

## Agent Runtime Model

Treat the agent harness as the kernel around an LLM OS:

| Kernel concern | Agent harness responsibility |
|---|---|
| Memory management | Curate context, summarize bulky outputs, persist state to wiki/logs. |
| Syscall boundary | Expose tools with contracts, allowlists, deny rules, and retries. |
| Process isolation | Separate write scopes, sandboxes, credentials, and state per agent. |
| Scheduling | Decide sequential, parallel, or event-driven execution. |
| Interrupts | Stop, ask approval, rollback, or route to a safer action. |
| Observability | Log tool calls, decisions, outputs, costs, and verification evidence. |
| Garbage collection | Close idle agents, remove stale tasks, compact context, and record risks. |

## Productized Agent Harness

Google I/O '26 added a practical pressure test for harness design: the same runtime pattern now appears in developer tools, personal agents, search, commerce, generative media, and smart glasses.

| Product surface | Harness control that must exist |
|---|---|
| Agent-first IDE | task queue, subagent ownership, hooks, sandbox, test proof |
| Personal agent | user mandate, memory scope, tool allowlist, resumable log |
| Agentic search | source provenance, comparison criteria, action preview |
| Agentic commerce | budget, merchant/payment boundary, mandate, receipt trail |
| Generative media | prompt/edit history, watermark/disclosure, content credentials |
| Ambient eyewear/device | sensor consent, privacy mode, physical-world fallback |

If a harness cannot produce an audit trail for what the agent saw, decided, called, changed, and verified, the agent is not ready for delegated action.

## Managed Agent Runtime Model

For production-like agents, separate the runtime into three resources:

| Resource | Defines | Harness questions |
|---|---|---|
| Agent | Model, persona, system prompt, skills, MCP/tools | Is the role narrow enough? Are capabilities necessary? |
| Environment | Execution space, container, network, credentials, filesystem | What can the agent reach? What is allowlisted or denied? |
| Session | Agent instance, mounted context, event stream, durable state | How is state resumed, deleted, audited, and recovered? |

The session event log is the backbone of reliability. It should capture user messages, tool calls, tool results, agent responses, verification evidence, errors, and recovery actions. A response without an inspectable event trail is not enough for delegated production work.

## Permission Bike Method

Escalate autonomy by proven reliability, not by confidence in a prompt:

| Stage | Allowed capability | Required proof |
|---|---|---|
| Observe | Read, search, summarize, recommend | Sources and assumptions are inspectable. |
| Co-drive | Draft, simulate, prepare changes | Human approves every external action. |
| Training wheels | Execute low-risk scoped actions | Logs, rollback, and post-action checks pass. |
| Supervised autonomy | Run reversible routines | Alerts, receipts, and anomaly review exist. |
| Autonomy | Run high-frequency low-risk loops | Periodic permission audit and failure review. |

Do not give a tool key and rely on text instructions to prevent misuse. The real boundary is the key, endpoint, account, filesystem, network, budget, and approval path.

## Usage Template

**Prompt**
```text
Use harness-engineering for this agent workflow. Design permissions, tools, feedback loops, observability, and failure handling.
```

**Use Case**
- Moving an agent workflow from ad hoc prompting toward a reliable runtime architecture.

**Expected Result**
- The agent produces a harness design with permission tiers, tool boundaries, logs, evals, and recovery paths.

**Output Example**
- A runtime spec with permission matrix, tool allowlist, approval gates, logs, evals, and incident response.

**Verification Case**
- The design names what the agent can do automatically, what needs approval, and what is denied.

**Verified Effect**
- An ad hoc agent workflow becomes a controlled runtime with explicit permissions, observability, and failure handling.

## Success Metrics

- Design specifies permissions, tool contracts, observability, failure handling, and recovery path.
- High-risk actions have approval or sandbox boundaries.
- Verification evidence is defined before deployment or automation.

## When to Use

- Deploying agents to production
- Setting up permissions/guardrails/approval workflows
- Designing multi-agent systems
- Agent behaved unpredictably → needs better constraints
- Configuring auto-mode or permission tiers

---

## Three Domains

| Domain | Object | Maturity |
|--------|--------|----------|
| **Physical** | Wire harnesses (automotive/aerospace) | ⭐ Mature |
| **Software** | CI/CD pipelines (Harness.io) | ⭐ Mature |
| **Cognitive** ⭐ | AI Agents | 🌱 Emerging |

---

## Six Components

### 1. Context & Knowledge Layer
- Curated access to code, docs, schemas, logs
- Use `CLAUDE.md` for project-level context
- Use `context-manager` for token budgeting
- Never inject raw 10K+ token files
- Persist reusable outputs to wiki, docs, logs, or state files; chat history is not durable memory

### 2. Tooling & API Surface
**Three-Tier Permission Model:**

| Tier | Scope | Mechanism | Examples |
|:----:|-------|-----------|----------|
| **1** | Safe tools | Always allowed | Read, search, grep, glob |
| **2** | In-project | Auto-approve (git reviewable) | Write/edit in project dir |
| **3** | High-risk | Classifier/human approval | Shell, API calls, deletes |

**Tier 3 heuristic:**
```
1. Can destroy data irreversibly? → BLOCK
2. Accesses credentials? → BLOCK
3. Affects shared infrastructure? → BLOCK
4. T

Del mismo repositorio

daily-okrSkill

Execute a daily knowledge compound closed loop — 7 Key Results from input to feedback with scoring. Use when the user wants to do a daily review, plan their day, or run a knowledge workflow.

session-learnSkill

Extract reusable knowledge from a work session and save concepts, entities, corrections, patterns, ideas, decisions, and gaps to the wiki. Use when ending a session or when the user says to extract knowledge.

token-cost-trackerSlash Command

Estimate and track token usage and cost across the knowledge pipeline. Run before expensive tasks to budget, after tasks to log actuals.

wiki-lintSkill

Health-check the knowledge wiki — find orphans, broken links, missing frontmatter, contradictions, stale content, and statistical drift. Use when the user says "lint the wiki", "health check", or periodically for maintenance.

agent-teams-commandSkill

Command multi-agent work with bounded roles, ownership, integration gates, and verification loops. Use when the user needs Claude Code Agent Teams, parallel agents, delegation strategy, or multi-agent orchestration.

agentic-engineeringSkill

Design or refactor agent skills, workflows, and operating loops for model-native Agentic Engineering. Use when making skills more autonomous, concise, verifiable, long-horizon capable, token-efficient, and lower-friction for human-LLM collaboration.

ai-six-sigma-property-osSkill

Design an AI Six Sigma Black Belt operating model for property service, maintenance dispatch, environmental testing, quote generation, CRM follow-up, and workflow quality dashboards. Use when the user needs a Property Agent OS, AI + Ontology + DMAIC management system, CTQ metrics, agent-team roles, work-order states, or MVP roadmap for operations quality.

anthropic-osSkill

Improve a personal or team operating system with self-evolving loops, CASH allocation, 3B creativity, predictive coding, and diagnostics. Use when the user wants to redesign a work method, learning loop, or cognitive operating system.