Skip to main content
ClaudeWave
Skill92 repo starsupdated today

agent-wiki-tasks

Discover task families across summaries and write per-family comparison pages with findings narrative. Updates wiki-twobatch/_config.yaml task definitions and writes tasks/<slug>__task.md.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/AgentToolkit/altk-evolve /tmp/agent-wiki-tasks && cp -r /tmp/agent-wiki-tasks/explorations/agent-wiki/skills/agent-wiki-tasks ~/.claude/skills/agent-wiki-tasks
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Agent Wiki — Task Comparisons

## Overview

Two cognitive moves in one pass:

1. **Discover** — read across all summaries and identify task families
   (groups of sessions that attempted the same thing across trials and
   conditions).
2. **Compare** — for each family, write a `tasks/<slug>__task.md` page with a
   per-trial table and a findings narrative that calls out the
   experimental signal.

This is the cross-trajectory **analysis** pass of the `agent-wiki` family.

## When to run

- After enough summaries exist that a comparative pattern is visible
  (typically ≥3 sessions per family).
- When the experiment design (e.g. trial × condition matrices) explicitly
  cries out for a comparison page.

## Workflow

### Step 1: Read the corpus

```bash
uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py dump-summaries > /tmp/summaries.json
```

Output is a JSON array of one row per summary: `{session_id, goal, family,
trial, condition, tool_calls, errors, recall_used, summary_filename}`.
`family`, `trial`, `condition` come from existing classification rules —
they may be null if no rule has matched yet.

Read the file:

```
Read /tmp/summaries.json
```

### Step 2: Decide task families

For each candidate task family:

- **Slug**: kebab-case identifier (e.g. `extract-focal-length`).
- **Family**: short label used to group sessions (often equals slug, but
  can be looser e.g. `focal-length` for a slug `extract-focal-length`).
- **Family-match rules**: how a future session gets classified. Currently
  supported: `goal_substring: [list of substrings]`. A session matches
  the family if its `goal` contains any substring (case-insensitive).
- **Tags**: a few short tags.
- **Intro**: 1–2 sentences setting up the question.
- **Findings**: 2–5 bullets summarizing what the data shows. **This is
  the actual product** — a comparison page without findings is just a
  table.

Rules:

1. **A family needs ≥3 sessions.** Smaller groups should not get their own page.
2. **Findings must be evidence-grounded.** Cite tool-call counts, error counts, recall-used Y/N from the dump.
3. **Don't repeat what's in the table.** Findings should explain *why* the metrics differ, not restate them.
4. **Use overrides** for sessions whose `goal` doesn't auto-match. The override key in `_config.yaml/session_family_overrides` is the session id.

### Step 3: For each family, output JSON

```json
{
  "slug": "extract-focal-length",
  "title": "Extract focal length from JPEG EXIF",
  "family": "focal-length",
  "family_match": {
    "goal_substring": ["focal length"]
  },
  "intro": "Question template: *what focal length was used to take @sample.jpg?* FocalLength (tag 0x920A) and FocalLengthIn35mmFilm (tag 0xA405) live in the Exif sub-IFD.",
  "findings": "**Net signal:** the gap between IFD0/GPS-only scripts and the Exif sub-IFD is the dominant cost. Sessions whose recall pointed at a script that already covered the sub-IFD finished in 2-3 tool calls; sessions that had to write an inline parser took 5+.",
  "tags": ["exif", "focal-length", "comparison"]
}
```

Pipe to:

```bash
echo '<json>' | uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py render-task
```

The helper:

- Updates `_config.yaml/tasks.<slug>` entry.
- Reads classified sessions; selects those matching `family`.
- Writes `tasks/<slug>__task.md` with the per-trial table + findings.

### Step 4: Add overrides if needed

If a session that *should* be in a family didn't classify automatically,
patch `_config.yaml`:

```bash
echo '{"session_family_overrides": {"<session-id>": {"family": "image-dims", "trial": 0, "condition": "claude_md_strong"}}}' \
  | uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py update-config
```

### Step 5: Subtask pass — mandatory before refresh

Before refreshing indexes, scan the corpus for **subtask candidates**. The
default reflex of "the dataset is uniform, no subtasks needed" is wrong
for almost every dataset; even a 30-session benchmark of short workflows
typically has 4-6 subtask-worthy sessions. See "## Subtasks" below for
the heuristics + JSON contract + a worked example.

The minimum viable subtask layer for a condition × trial dataset: one
subtask per condition, anchored in the session that best demonstrates
that condition's distinctive behavior. Don't write 5 redundant subtasks
when 1 representative captures the pattern.

### Step 6: Refresh indexes

```bash
uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py catalog
```

This re-reads `_config.yaml`, re-classifies every summary, regenerates
each `tasks/<slug>__task.md`, scans `tasks/<slug>__subtask.md` files,
and regenerates `tasks/index.md` and the root `index.md`.

## Subtasks: per-session workstream pages

The `tasks/` directory holds *two* kinds of pages distinguished by filename
suffix:

- **`<slug>__task.md`** — cross-session task-comparisons (the workflow above).
- **`<slug>__subtask.md`** — narrative slices of a *single* session.

After Step 5 above, run a **second pass** to scan for subtask candidates.
Don't skip this just because the dataset is uniform — a 30-session benchmark
of short workflows still has 4-6 subtask-worthy sessions. The default
"there are no subtasks worth writing" reflex is wrong for almost every
dataset.

### When to propose a subtask

Treat each session in the corpus as a potential subtask candidate.
**Promote** to a subtask page when at least one of these is true:

1. **Exemplar of a condition or arc.** When the corpus has experimental
   conditions (`no_recall` / `guidelines` / `skill`, or arc-1 / arc-2),
   pick the session that best demonstrates *that condition's* distinctive
   behavior — its representative-best, representative-worst, or
   representative-failure trace — and write a subtask. Aim for one subtask
   per condition × dataset, not one per session.
2. **Multi-iteration debug arc.** A session where the agent retried 3+
   times against the s
agent-wiki-consolidate-guidelinesSkill

Read all atomic guidelines in wiki-twobatch/guidelines/ and propose themed clusters that group near-duplicates. Writes cluster pages and updates _config.yaml; originals are preserved with a `superseded_by:` backref.

agent-wiki-consultSkill

Consult an agent-wiki for guidelines relevant to the task at hand. The wiki itself documents how to retrieve from it (AGENTS.md). Use this skill once you know what task or sub-task you're about to do — not at session start.

agent-wiki-extract-guidelinesSkill

Read a normalized Claude Code trajectory JSON and extract reusable guidelines into wiki-twobatch/guidelines/. Use when mining saved trajectories for reusable lessons.

agent-wiki-ingestSkill

Ingest one or more agent trajectories (raw bob/claude traces or normalized JSON) into an agent-wiki end-to-end — convert, summarize, extract guidelines, synthesize skills, consolidate into clusters, and catalog. Use when you have a batch of traces to turn into a wiki in one pass.

agent-wiki-summarizeSkill

Read a normalized Claude Code trajectory JSON and write an episodic summary page to wiki-twobatch/summaries/. Use when summarizing one or more saved trajectories into the agent wiki.

agent-wiki-synthesize-skillSkill

Read a normalized Claude Code trajectory JSON and produce a wiki-resident SKILL.md page that future agents can invoke. Use when a trajectory captured a non-trivial successful workflow worth promoting from a free-text guideline to an executable, callable artifact.

evolve-lite:learnSkill

Must be used near the end of any non-trivial turn that produced potentially reusable tools, guidance, errors, workarounds, or workflows, so those lessons are saved for future turns.

evolve-lite:provenanceSkill

Analyze saved trajectories and recall audit events offline to record whether recalled guidelines influenced completed sessions.