agent-wiki-tasks
Discover task families across summaries and write per-family comparison pages with findings narrative. Updates wiki-twobatch/_config.yaml task definitions and writes tasks/<slug>__task.md.
git clone --depth 1 https://github.com/AgentToolkit/altk-evolve /tmp/agent-wiki-tasks && cp -r /tmp/agent-wiki-tasks/explorations/agent-wiki/skills/agent-wiki-tasks ~/.claude/skills/agent-wiki-tasksSKILL.md
# Agent Wiki — Task Comparisons
## Overview
Two cognitive moves in one pass:
1. **Discover** — read across all summaries and identify task families
(groups of sessions that attempted the same thing across trials and
conditions).
2. **Compare** — for each family, write a `tasks/<slug>__task.md` page with a
per-trial table and a findings narrative that calls out the
experimental signal.
This is the cross-trajectory **analysis** pass of the `agent-wiki` family.
## When to run
- After enough summaries exist that a comparative pattern is visible
(typically ≥3 sessions per family).
- When the experiment design (e.g. trial × condition matrices) explicitly
cries out for a comparison page.
## Workflow
### Step 1: Read the corpus
```bash
uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py dump-summaries > /tmp/summaries.json
```
Output is a JSON array of one row per summary: `{session_id, goal, family,
trial, condition, tool_calls, errors, recall_used, summary_filename}`.
`family`, `trial`, `condition` come from existing classification rules —
they may be null if no rule has matched yet.
Read the file:
```
Read /tmp/summaries.json
```
### Step 2: Decide task families
For each candidate task family:
- **Slug**: kebab-case identifier (e.g. `extract-focal-length`).
- **Family**: short label used to group sessions (often equals slug, but
can be looser e.g. `focal-length` for a slug `extract-focal-length`).
- **Family-match rules**: how a future session gets classified. Currently
supported: `goal_substring: [list of substrings]`. A session matches
the family if its `goal` contains any substring (case-insensitive).
- **Tags**: a few short tags.
- **Intro**: 1–2 sentences setting up the question.
- **Findings**: 2–5 bullets summarizing what the data shows. **This is
the actual product** — a comparison page without findings is just a
table.
Rules:
1. **A family needs ≥3 sessions.** Smaller groups should not get their own page.
2. **Findings must be evidence-grounded.** Cite tool-call counts, error counts, recall-used Y/N from the dump.
3. **Don't repeat what's in the table.** Findings should explain *why* the metrics differ, not restate them.
4. **Use overrides** for sessions whose `goal` doesn't auto-match. The override key in `_config.yaml/session_family_overrides` is the session id.
### Step 3: For each family, output JSON
```json
{
"slug": "extract-focal-length",
"title": "Extract focal length from JPEG EXIF",
"family": "focal-length",
"family_match": {
"goal_substring": ["focal length"]
},
"intro": "Question template: *what focal length was used to take @sample.jpg?* FocalLength (tag 0x920A) and FocalLengthIn35mmFilm (tag 0xA405) live in the Exif sub-IFD.",
"findings": "**Net signal:** the gap between IFD0/GPS-only scripts and the Exif sub-IFD is the dominant cost. Sessions whose recall pointed at a script that already covered the sub-IFD finished in 2-3 tool calls; sessions that had to write an inline parser took 5+.",
"tags": ["exif", "focal-length", "comparison"]
}
```
Pipe to:
```bash
echo '<json>' | uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py render-task
```
The helper:
- Updates `_config.yaml/tasks.<slug>` entry.
- Reads classified sessions; selects those matching `family`.
- Writes `tasks/<slug>__task.md` with the per-trial table + findings.
### Step 4: Add overrides if needed
If a session that *should* be in a family didn't classify automatically,
patch `_config.yaml`:
```bash
echo '{"session_family_overrides": {"<session-id>": {"family": "image-dims", "trial": 0, "condition": "claude_md_strong"}}}' \
| uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py update-config
```
### Step 5: Subtask pass — mandatory before refresh
Before refreshing indexes, scan the corpus for **subtask candidates**. The
default reflex of "the dataset is uniform, no subtasks needed" is wrong
for almost every dataset; even a 30-session benchmark of short workflows
typically has 4-6 subtask-worthy sessions. See "## Subtasks" below for
the heuristics + JSON contract + a worked example.
The minimum viable subtask layer for a condition × trial dataset: one
subtask per condition, anchored in the session that best demonstrates
that condition's distinctive behavior. Don't write 5 redundant subtasks
when 1 representative captures the pattern.
### Step 6: Refresh indexes
```bash
uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py catalog
```
This re-reads `_config.yaml`, re-classifies every summary, regenerates
each `tasks/<slug>__task.md`, scans `tasks/<slug>__subtask.md` files,
and regenerates `tasks/index.md` and the root `index.md`.
## Subtasks: per-session workstream pages
The `tasks/` directory holds *two* kinds of pages distinguished by filename
suffix:
- **`<slug>__task.md`** — cross-session task-comparisons (the workflow above).
- **`<slug>__subtask.md`** — narrative slices of a *single* session.
After Step 5 above, run a **second pass** to scan for subtask candidates.
Don't skip this just because the dataset is uniform — a 30-session benchmark
of short workflows still has 4-6 subtask-worthy sessions. The default
"there are no subtasks worth writing" reflex is wrong for almost every
dataset.
### When to propose a subtask
Treat each session in the corpus as a potential subtask candidate.
**Promote** to a subtask page when at least one of these is true:
1. **Exemplar of a condition or arc.** When the corpus has experimental
conditions (`no_recall` / `guidelines` / `skill`, or arc-1 / arc-2),
pick the session that best demonstrates *that condition's* distinctive
behavior — its representative-best, representative-worst, or
representative-failure trace — and write a subtask. Aim for one subtask
per condition × dataset, not one per session.
2. **Multi-iteration debug arc.** A session where the agent retried 3+
times against the sRead all atomic guidelines in wiki-twobatch/guidelines/ and propose themed clusters that group near-duplicates. Writes cluster pages and updates _config.yaml; originals are preserved with a `superseded_by:` backref.
Consult an agent-wiki for guidelines relevant to the task at hand. The wiki itself documents how to retrieve from it (AGENTS.md). Use this skill once you know what task or sub-task you're about to do — not at session start.
Read a normalized Claude Code trajectory JSON and extract reusable guidelines into wiki-twobatch/guidelines/. Use when mining saved trajectories for reusable lessons.
Ingest one or more agent trajectories (raw bob/claude traces or normalized JSON) into an agent-wiki end-to-end — convert, summarize, extract guidelines, synthesize skills, consolidate into clusters, and catalog. Use when you have a batch of traces to turn into a wiki in one pass.
Read a normalized Claude Code trajectory JSON and write an episodic summary page to wiki-twobatch/summaries/. Use when summarizing one or more saved trajectories into the agent wiki.
Read a normalized Claude Code trajectory JSON and produce a wiki-resident SKILL.md page that future agents can invoke. Use when a trajectory captured a non-trivial successful workflow worth promoting from a free-text guideline to an executable, callable artifact.
Must be used near the end of any non-trivial turn that produced potentially reusable tools, guidance, errors, workarounds, or workflows, so those lessons are saved for future turns.
Analyze saved trajectories and recall audit events offline to record whether recalled guidelines influenced completed sessions.