Skill134 repo starsupdated yesterday

optimizing-skills

optimizing-skills is a validation-gated workflow for revising existing skills by measuring performance changes against a fixed held-out check set before shipping edits. Use it when a skill exhibits observed failures, drift, or complaints and you need evidence that a proposed change actually improves behavior on the triggering criterion without regressing other dimensions. The discipline enforces scoring both the current version and candidate version on representative tasks, accepting edits only when the candidate strictly beats the baseline.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/optimizing-skills && cp -r /tmp/optimizing-skills/optimizing-skills ~/.claude/skills/optimizing-skills

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Optimizing Skills

Treat the skill document as the parameter under optimization: change it only
when the change **demonstrably beats the version you already ship**. This is
the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) —
its training apparatus dropped, its reproducibility discipline kept. The point
is to stop editing skills on intuition and start editing them on evidence.

## Core principle

A skill edit is only worth shipping if it strictly improves measured behavior
on a held-out check. Most edits that *feel* like improvements don't move the
needle, and some quietly regress. The gate below is what separates a real
improvement from a confident guess.

## The gate — run it every revision

1. **Assemble a held-out check set.** 3–8 representative tasks/prompts the skill
should handle well, and it **must include the failure(s) that prompted this
revision**. Keep the set fixed across the revision so before/after scores are
comparable.
2. **Hold two versions.** `best` = what you currently ship (never let it
silently degrade). `candidate` = `best` + your proposed edits.
3. **Score both on the check set.** "Run" here = dispatch each check task to the
**Agent tool** (`subagent_type=general-purpose`) with the skill version in
context, or evaluate by hand for small sets. Score **per criterion**, not one
collapsed pass/fail. When a task carries several criteria, the criterion that
decides accept/reject is **the failure that prompted this revision**; the
others are regression guards that must not get worse. Collapsing criteria
masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural
hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both
arms — a single combined pass/fail scored that as a 0–0 tie and would have
rejected a large, real improvement.
4. **Accept only if `candidate` strictly beats `best`** on the triggering-failure
criterion, with no regression guard worse. Ties → reject, keep `best`. An edit
that doesn't move the needle does not ship.

**When the skill's own output is compiled by an Agent** (down-skilling and
creating-skill produce a prompt an author writes from the SKILL), score **≥2
author samples per version**, or fix one author across both arms. A single
author sample per arm lets author capability dominate the edit effect: the same
down-skilling edit measured 95%→0% with one author pair and 60%→0% with another
— real either way, but n=1 cannot tell a real edit from a lucky author.

This two-tier `best`/`candidate` split is the heart of it: a working revision
can explore, but the shipped skill only ever ratchets upward.

## Bounded edits (the "textual learning rate")

Cap edits per revision — **default ~4** distinct add/replace/delete operations,
fewer as the skill matures. Large speculative rewrites drift and destroy your
ability to attribute a regression to a cause. If a revision wants more edits
than the budget, rank and keep the top ones (below) and let the rest wait.

## Reflect: failures first, then successes

Separate the evidence before proposing edits:

- **Failure reflection.** Across the failing cases, find the *common,
systematic* pattern — not a one-off edge case. Propose edits that fix the
pattern. **Failures take priority** in any merge.
- **Success reflection.** Across cases that already work, find generalizable
patterns worth encoding so they survive future edits. Reinforce; don't
duplicate.

For both: edits must **generalize** (never hardcode task-specific values), and
must **not duplicate** content already in the skill — patch genuine gaps only.

## Rank when over budget

When candidate edits exceed the budget, keep them in this priority order:

1. **Systematic impact** — fixes a recurring failure across many cases, not one.
2. **Complementarity** — fills a real gap rather than restating existing content.
3. **Generality** — phrased as a durable principle, not tied to one task/entity.
4. **Actionability** — concrete, followable guidance over vague advice.

Drop the rest. They can return next revision if still warranted.

## Protect the hard-won core

If a skill has a battle-tested core that routine edits keep eroding, fence it
off and treat it as off-limits to fast edits. Revisit it only on a **deliberate
longitudinal review**: compare the *same* check tasks across several versions to
catch slow drift and regressions that single-edit review misses. (SkillOpt
fences this region with HTML-comment markers and only rewrites it at epoch
boundaries — the same idea, manual cadence.)

## Carry memory across revisions

After a revision, record what you learned about editing **this** skill — which
kinds of edits helped, which were brittle, redundant, or harmful — via
`remember()` tagged with the skill name. Before the next revision, `recall()`
it. This is the compounding part: each revision starts smarter than the last,
the way SkillOpt's optimizer-side meta-skill conditions its future edits.

## Edit mechanics

Edits are literal string operations (the `Edit` tool): the target text must
match **exactly** or the edit is a silent no-op. Keep targets unique and
verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact,
and verify each edit landed before scoring.

## When NOT to use this

- Authoring a brand-new skill from scratch → **creating-skill**.
- Tracking/rolling back versions during development → **versioning-skills**.
- One-off prose with no reuse → just write it.

## Checklist

- [ ] Held-out check set assembled (includes the triggering failure), fixed for the revision
- [ ] Edits bounded (~4 max), each generalizable and non-duplicative
- [ ] Failure patterns addressed before success reinforcement
- [ ] Candidate scored per-criterion against `best`; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly better
- [ ] For Agent-compiled artifacts (down-skilling, creating-