agent-runtime-cache-benchmark
This Claude Code skill compares two structured agent run artifacts to analyze prompt-cache effectiveness, identifying whether changes in prompt layout, tool manifest, or history preserve cache reuse. Use it when investigating why cached tokens increased or decreased between runs, evaluating whether a workflow structure supports prompt caching before deploying automation, or diagnosing cache breaks by examining stable hashes and latency metrics across cold and warm runs.
git clone --depth 1 https://github.com/Prompthon-IO/agent-systems-handbook /tmp/agent-runtime-cache-benchmark && cp -r /tmp/agent-runtime-cache-benchmark/skills/agent-runtime-cache-benchmark ~/.claude/skills/agent-runtime-cache-benchmarkSKILL.md
# Agent Runtime Cache Benchmark
For a student-facing explanation of why this package exists and how the
end-to-end workflow fits into the handbook, read `README.md` first. This file
is the invocation contract for Codex.
## Overview
Use this skill to compare a colder first run with a warmer rerun of the same
workflow. The goal is not to guess provider internals. It is to inspect the run
artifacts you already have and explain whether your prompt spine stayed stable
enough to benefit from prompt caching.
This skill is local-first and report-first:
1. collect or normalize two structured run artifacts
2. compare latency, prompt tokens, cached tokens, and stable hashes
3. explain likely cache breaks
4. write a small Markdown report for operator review
## When To Use
Use this skill when the user asks for tasks such as:
- compare a cold run and a warm rerun for the same agent workflow
- explain why cached tokens dropped between two agent runs
- estimate whether a prompt or tool layout is cache-friendly before automation
- separate prompt-cache behavior from durable memory or retrieval behavior
Do not use this skill to claim provider-side savings that are not supported by
the input artifacts. If the run metadata does not expose token or cache fields,
say that clearly and limit the report to structural stability checks.
## Expected Input Shape
The helper accepts two JSON files with small, explicit fields such as:
```json
{
"label": "warm-rerun",
"latency_ms": 2900,
"prompt_tokens": 1840,
"cached_tokens": 1536,
"system_prompt_hash": "sys-v1",
"tool_manifest_hash": "tools-v1",
"history_hash": "history-v2",
"notes": [
"User-specific inputs were appended at the end."
]
}
```
Useful optional fields:
- `output_tokens`
- `prompt_cache_key`
- `prefix_hash`
- `notes`
## Local State And Outputs
Keep runtime artifacts outside git. The helper does not create this directory
layout automatically unless you point `--output` there; treat it as a
recommended convention:
```text
~/.codex/state/agent-runtime-cache-benchmark/
inputs/
reports/
```
## Commands
Run the helper relative to this skill directory.
Preview the CLI:
```bash
python3 scripts/cache_benchmark.py --help
```
Generate a Markdown report:
```bash
python3 scripts/cache_benchmark.py \
--cold-run /path/to/cold-run.json \
--warm-run /path/to/warm-run.json \
--output /path/to/cache-benchmark.md
```
Emit JSON instead:
```bash
python3 scripts/cache_benchmark.py \
--cold-run /path/to/cold-run.json \
--warm-run /path/to/warm-run.json \
--format json
```
## Interpretation Rules
- Treat changes to `system_prompt_hash`, `tool_manifest_hash`, or `prefix_hash`
as likely cache-break events.
- Treat changes to `history_hash` as a likely warm-path spoiler when the run is
expected to reuse a long prefix.
- Keep durable memory or retrieval changes separate from cache analysis unless
they directly altered the prompt prefix.
- Prefer appending variable user input at the end of the prompt spine when the
provider's cache rules reward exact prefix stability.
## Safety Boundaries
- Do not require committing transcripts, logs, or API credentials.
- Keep reports local by default.
- Do not infer business-sensitive content from hashes or token counts alone.
- Do not blur prompt caching with durable memory design; they solve different
problems.
## Response Pattern
When reporting back, include:
- cold vs warm latency
- cold vs warm cached-token share
- the most likely cache-break fields
- whether the prompt spine appears stable enough for reuse
- what to move earlier or later in the prompt if the user wants better hits用 connector-first、最少 token 的方式审阅 Gmail 客户支持线程。适用于 Codex 需要通过 Codex Gmail connector 读取 Gmail、把清洗后的消息导入本地 SQLite 问题队列、先执行确定性的清洗和分类、只把模型调用保留给 JSON-only 的消息理解和草稿字段生成,并支持 dashboard 审阅、客户审批与排队回复处理的时候。
审阅客户来信,以本地政策或 FAQ 文档为依据起草安全的投诉和咨询回复。
Persistent daily news monitoring backed by local SQLite and Markdown reports. Use when a user asks Codex to track named publications, fetch the last N hours of news, summarize recent articles by topic, deduplicate articles across runs, or maintain a personal newsroom that survives across sessions.
Scan local cleanup targets, apply readable cleanup rules, produce a preview report, and execute approved cleanup actions with logs. Use when a user asks Codex to clean up their computer, empty old Trash items, find duplicated Downloads files, review local storage clutter, or propose safe file cleanup actions before making changes.
Preview-first local file organizer. Scan a user-named folder, classify files into category subfolders using readable rules, write a preview Markdown report and JSON plan, execute confirmed moves with persistent SQLite state, and reverse moves with undo. Use when a user asks Codex to organize Downloads, sort a messy folder into Invoices/Receipts/School/Images/Software/PDFs subfolders, propose a folder structure before moving anything, or undo a previous organization run.
Capture local or explicitly provided web knowledge sources into cited Markdown notes. Use when a user asks Codex to watch a research folder, register local folders for later scans, summarize new or modified local Markdown/TXT/PDF/DOCX files, capture a provided URL, maintain SQLite state for personal knowledge capture, or generate searchable source-grounded daily notes.
Persistent product price tracking for natural-language product requests. Use when a user asks to watch, track, monitor, compare, or report prices for a product, especially with a target price or threshold such as "Watch MacBook Pro M3 14-inch and tell me if it drops below $1200." Supports source discovery, Playwright/browser product checks, SQLite history, threshold comparison, and Markdown price reports.
Plan and inspect prompt-cache behavior for long-running Claude agent loops. Use when a user wants to split stable tool, system, and history context into cacheable layers, compare captured cache metadata, estimate cost impact from supplied pricing inputs, or keep durable memory outside the cached prefix.