Skip to main content
ClaudeWave
Skill125 estrellas del repoactualizado today

optimizing-skills

Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/optimizing-skills && cp -r /tmp/optimizing-skills/optimizing-skills ~/.claude/skills/optimizing-skills
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Optimizing Skills

Treat the skill document as the parameter under optimization: change it only
when the change **demonstrably beats the version you already ship**. This is
the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) —
its training apparatus dropped, its reproducibility discipline kept. The point
is to stop editing skills on intuition and start editing them on evidence.

## Core principle

A skill edit is only worth shipping if it strictly improves measured behavior
on a held-out check. Most edits that *feel* like improvements don't move the
needle, and some quietly regress. The gate below is what separates a real
improvement from a confident guess.

## The gate — run it every revision

1. **Assemble a held-out check set.** 3–8 representative tasks/prompts the skill
   should handle well, and it **must include the failure(s) that prompted this
   revision**. Keep the set fixed across the revision so before/after scores are
   comparable.
2. **Hold two versions.** `best` = what you currently ship (never let it
   silently degrade). `candidate` = `best` + your proposed edits.
3. **Score both on the check set.** "Run" here = dispatch each check task to the
   **Agent tool** (`subagent_type=general-purpose`) with the skill version in
   context, or evaluate by hand for small sets. Score **per criterion**, not one
   collapsed pass/fail. When a task carries several criteria, the criterion that
   decides accept/reject is **the failure that prompted this revision**; the
   others are regression guards that must not get worse. Collapsing criteria
   masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural
   hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both
   arms — a single combined pass/fail scored that as a 0–0 tie and would have
   rejected a large, real improvement.
4. **Accept only if `candidate` strictly beats `best`** on the triggering-failure
   criterion, with no regression guard worse. Ties → reject, keep `best`. An edit
   that doesn't move the needle does not ship.

**When the skill's own output is compiled by an Agent** (down-skilling and
creating-skill produce a prompt an author writes from the SKILL), score **≥2
author samples per version**, or fix one author across both arms. A single
author sample per arm lets author capability dominate the edit effect: the same
down-skilling edit measured 95%→0% with one author pair and 60%→0% with another
— real either way, but n=1 cannot tell a real edit from a lucky author.

This two-tier `best`/`candidate` split is the heart of it: a working revision
can explore, but the shipped skill only ever ratchets upward.

## Bounded edits (the "textual learning rate")

Cap edits per revision — **default ~4** distinct add/replace/delete operations,
fewer as the skill matures. Large speculative rewrites drift and destroy your
ability to attribute a regression to a cause. If a revision wants more edits
than the budget, rank and keep the top ones (below) and let the rest wait.

## Reflect: failures first, then successes

Separate the evidence before proposing edits:

- **Failure reflection.** Across the failing cases, find the *common,
  systematic* pattern — not a one-off edge case. Propose edits that fix the
  pattern. **Failures take priority** in any merge.
- **Success reflection.** Across cases that already work, find generalizable
  patterns worth encoding so they survive future edits. Reinforce; don't
  duplicate.

For both: edits must **generalize** (never hardcode task-specific values), and
must **not duplicate** content already in the skill — patch genuine gaps only.

## Rank when over budget

When candidate edits exceed the budget, keep them in this priority order:

1. **Systematic impact** — fixes a recurring failure across many cases, not one.
2. **Complementarity** — fills a real gap rather than restating existing content.
3. **Generality** — phrased as a durable principle, not tied to one task/entity.
4. **Actionability** — concrete, followable guidance over vague advice.

Drop the rest. They can return next revision if still warranted.

## Protect the hard-won core

If a skill has a battle-tested core that routine edits keep eroding, fence it
off and treat it as off-limits to fast edits. Revisit it only on a **deliberate
longitudinal review**: compare the *same* check tasks across several versions to
catch slow drift and regressions that single-edit review misses. (SkillOpt
fences this region with HTML-comment markers and only rewrites it at epoch
boundaries — the same idea, manual cadence.)

## Carry memory across revisions

After a revision, record what you learned about editing **this** skill — which
kinds of edits helped, which were brittle, redundant, or harmful — via
`remember()` tagged with the skill name. Before the next revision, `recall()`
it. This is the compounding part: each revision starts smarter than the last,
the way SkillOpt's optimizer-side meta-skill conditions its future edits.

## Edit mechanics

Edits are literal string operations (the `Edit` tool): the target text must
match **exactly** or the edit is a silent no-op. Keep targets unique and
verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact,
and verify each edit landed before scoring.

## When NOT to use this

- Authoring a brand-new skill from scratch → **creating-skill**.
- Tracking/rolling back versions during development → **versioning-skills**.
- One-off prose with no reuse → just write it.

## Checklist

- [ ] Held-out check set assembled (includes the triggering failure), fixed for the revision
- [ ] Edits bounded (~4 max), each generalizable and non-duplicative
- [ ] Failure patterns addressed before success reinforcement
- [ ] Candidate scored per-criterion against `best`; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly better
- [ ] For Agent-compiled artifacts (down-skilling, creating-
accessing-github-reposSkill

GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.

api-credentialsSkill

Securely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.

asking-questionsSkill

Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.

assessing-impactSkill

>-

bm25Skill

>-

browsing-blueskySkill

Browse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.

building-github-indexSkill

Generate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".

categorizing-bsky-accountsSkill

Analyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.