bare-eval
Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation.
git clone --depth 1 https://github.com/yonatangross/orchestkit /tmp/bare-eval && cp -r /tmp/bare-eval/plugins/ork/skills/bare-eval ~/.claude/skills/bare-evalSKILL.md
# Bare Eval — Isolated Evaluation Calls
Run `claude -p --bare` for fast, clean eval/grading without plugin overhead.
**CC 2.1.81 required.** The `--bare` flag skips hooks, LSP, plugin sync, and skill directory walks.
## When to Use
- Grading skill outputs against assertions
- Trigger classification (which skill matches a prompt)
- Description optimization iterations
- Any scripted `-p` call that doesn't need plugins
## When NOT to Use
- Testing skill routing (needs `--plugin-dir`)
- Testing agent orchestration (needs full plugin context)
- Interactive sessions
## Prerequisites
```bash
# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81
```
## Quick Reference
| Call Type | Command Pattern |
|-----------|----------------|
| Grading | `claude -p "$prompt" --bare --max-turns 1 --output-format text` |
| Trigger | `claude -p "$prompt" --bare --json-schema "$schema" --output-format json` |
| Streaming grade | `claude -p "$prompt" --bare --max-turns 1 --output-format stream-json` |
| Optimize | `echo "$prompt" \| claude -p --bare --max-turns 1 --output-format text` |
| Force-skill | `claude -p "$prompt" --bare --print --append-system-prompt "$content"` |
| @-file in prompt | `claude -p "grade @fixtures/case-1.md against rubric" --bare` (CC 2.1.113 Remote Control autocomplete) |
### `--output-format stream-json`
Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response.
```bash
claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \
| while IFS= read -r line; do
# line is a single JSON event; inspect $.type == "content_block_delta"
jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line"
done
```
Use `stream-json` over `json` when:
- grading long outputs and you want incremental scoring,
- piping into another CLI step-by-step (e.g. `ork:eval-runner`),
- you need per-token timing data alongside the content.
## Invocation Patterns
Load detailed patterns and examples:
```
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
```
## Grading Schemas
JSON schemas for structured eval output:
```
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
```
## Pipeline Integration
OrchestKit's eval scripts (`npm run eval:skill`) auto-detect bare mode:
```bash
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
```
**Bare calls:** Trigger classification, force-skill, baseline, all grading.
**Never bare:** `run_with_skill` (needs plugin context for routing tests).
### CC 2.1.119: `--print` honors agent `tools:` / `disallowedTools:` (M122)
Before CC 2.1.119, `--print` mode ran with the full default tool set regardless of the agent's frontmatter `tools:` and `disallowedTools:`. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them.
**As of 2.1.119, `--print` enforces the agent's declared tool surface.** Implications for eval design:
| Consequence | Action |
|---|---|
| Eval graders that relied on unrestricted tool access may now fail | Audit grader prompts for tools they actually need; whitelist explicitly via the agent's `tools:` frontmatter |
| Eval results match interactive runs | Reproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed `--print` |
| `--agent <name>` also honors `permissionMode` in `--print` | Permission-gated tools (Bash, Edit) require either `permissionMode: acceptEdits` or explicit allowlists in the agent definition |
Migration test:
```bash
# Run an eval against an agent with a deliberately tight tools: list.
# Graders that previously called Read/Bash freely will now fail unless those
# tools are declared on the agent.
claude -p "$prompt" --bare --print --agent grader-test
```
If the grader fails with a "tool not permitted" error, add the required tool to the agent's `tools:` frontmatter and re-run.
### CC 2.1.121: `CLAUDE_CODE_FORK_SUBAGENT=1` for grader determinism (#1545)
Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, **non-interactive paths (`claude -p`, SDK) honor it too** — each grader invocation gets a fresh forked subagent context.
**The cross-eval state-leak problem this fixes:**
Without forking, sequential `claude -p --bare` graders inherit harness state:
| Inherited | Symptom |
|---|---|
| memory MCP query cache | grader sees stale hit from previous run; same fixture grades differently |
| `.claude/chain/*.json` on disk | grader for "implement" thinks "explore" already ran (file is from previous test) |
| ToolSearch deferred-tool cache | first grader's MCP loads bleed into next grader's tool registry |
| model picker pref | grader N inherits `--model=opus` from grader N-1 |
This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions.
**Fix:** `tests/evals/scripts/lib/eval-common.sh` exports `CLAUDE_CODE_FORK_SUBAGENT=1`, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow `.github/workflows/orchestkit-eval.yml` also sets it at the workflow level. Older CC silently ignores the env var (no-op).
**Determinism contract:** running the same grader on the same fixture twice in a row produces the **same score**. Verified by `tests/evals/scripts/test-grader-determinism.sh`.
## Performance
| Scenario | Without --bare | With --bare | Savings |
|----------|---------------|-------------|---------|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |Accessibility patterns for WCAG 2.2 compliance, keyboard focus management, React Aria component patterns, cognitive inclusion, native HTML-first philosophy, and user preference honoring. Use when implementing screen reader support, keyboard navigation, ARIA patterns, focus traps, accessible component libraries, reduced motion, or cognitive accessibility.
Agent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.
AI-assisted UI generation patterns for json-render, v0.app, Google Stitch, Bolt Cloud, and Cursor workflows. Covers prompt engineering for component and full-stack app generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
Queries local analytics across OrchestKit projects for agent usage, skill frequency, hook timing, team activity, session replay, cost estimation, and model delegation trends. Privacy-safe with hashed project IDs. Supports time-range filtering and comparative analysis. Use when reviewing performance, estimating costs, or understanding usage patterns.
Animation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.
API design patterns for REST/GraphQL framework design, versioning strategies, and RFC 9457 error handling. Use when designing API endpoints, choosing versioning schemes, implementing Problem Details errors, or building OpenAPI specifications.
Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.
Architecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.