optimizing-skills
Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.
git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/optimizing-skills && cp -r /tmp/optimizing-skills/optimizing-skills ~/.claude/skills/optimizing-skillsSKILL.md
# Optimizing Skills Treat the skill document as the parameter under optimization: change it only when the change **demonstrably beats the version you already ship**. This is the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) — its training apparatus dropped, its reproducibility discipline kept. The point is to stop editing skills on intuition and start editing them on evidence. ## Core principle A skill edit is only worth shipping if it strictly improves measured behavior on a held-out check. Most edits that *feel* like improvements don't move the needle, and some quietly regress. The gate below is what separates a real improvement from a confident guess. ## The gate — run it every revision 1. **Assemble a held-out check set.** 3–8 representative tasks/prompts the skill should handle well, and it **must include the failure(s) that prompted this revision**. Keep the set fixed across the revision so before/after scores are comparable. 2. **Hold two versions.** `best` = what you currently ship (never let it silently degrade). `candidate` = `best` + your proposed edits. 3. **Score both on the check set.** "Run" here = dispatch each check task to the **Agent tool** (`subagent_type=general-purpose`) with the skill version in context, or evaluate by hand for small sets. Score **per criterion**, not one collapsed pass/fail. When a task carries several criteria, the criterion that decides accept/reject is **the failure that prompted this revision**; the others are regression guards that must not get worse. Collapsing criteria masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both arms — a single combined pass/fail scored that as a 0–0 tie and would have rejected a large, real improvement. 4. **Accept only if `candidate` strictly beats `best`** on the triggering-failure criterion, with no regression guard worse. Ties → reject, keep `best`. An edit that doesn't move the needle does not ship. **When the skill's own output is compiled by an Agent** (down-skilling and creating-skill produce a prompt an author writes from the SKILL), score **≥2 author samples per version**, or fix one author across both arms. A single author sample per arm lets author capability dominate the edit effect: the same down-skilling edit measured 95%→0% with one author pair and 60%→0% with another — real either way, but n=1 cannot tell a real edit from a lucky author. This two-tier `best`/`candidate` split is the heart of it: a working revision can explore, but the shipped skill only ever ratchets upward. ## Bounded edits (the "textual learning rate") Cap edits per revision — **default ~4** distinct add/replace/delete operations, fewer as the skill matures. Large speculative rewrites drift and destroy your ability to attribute a regression to a cause. If a revision wants more edits than the budget, rank and keep the top ones (below) and let the rest wait. ## Reflect: failures first, then successes Separate the evidence before proposing edits: - **Failure reflection.** Across the failing cases, find the *common, systematic* pattern — not a one-off edge case. Propose edits that fix the pattern. **Failures take priority** in any merge. - **Success reflection.** Across cases that already work, find generalizable patterns worth encoding so they survive future edits. Reinforce; don't duplicate. For both: edits must **generalize** (never hardcode task-specific values), and must **not duplicate** content already in the skill — patch genuine gaps only. ## Rank when over budget When candidate edits exceed the budget, keep them in this priority order: 1. **Systematic impact** — fixes a recurring failure across many cases, not one. 2. **Complementarity** — fills a real gap rather than restating existing content. 3. **Generality** — phrased as a durable principle, not tied to one task/entity. 4. **Actionability** — concrete, followable guidance over vague advice. Drop the rest. They can return next revision if still warranted. ## Protect the hard-won core If a skill has a battle-tested core that routine edits keep eroding, fence it off and treat it as off-limits to fast edits. Revisit it only on a **deliberate longitudinal review**: compare the *same* check tasks across several versions to catch slow drift and regressions that single-edit review misses. (SkillOpt fences this region with HTML-comment markers and only rewrites it at epoch boundaries — the same idea, manual cadence.) ## Carry memory across revisions After a revision, record what you learned about editing **this** skill — which kinds of edits helped, which were brittle, redundant, or harmful — via `remember()` tagged with the skill name. Before the next revision, `recall()` it. This is the compounding part: each revision starts smarter than the last, the way SkillOpt's optimizer-side meta-skill conditions its future edits. ## Edit mechanics Edits are literal string operations (the `Edit` tool): the target text must match **exactly** or the edit is a silent no-op. Keep targets unique and verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact, and verify each edit landed before scoring. ## When NOT to use this - Authoring a brand-new skill from scratch → **creating-skill**. - Tracking/rolling back versions during development → **versioning-skills**. - One-off prose with no reuse → just write it. ## Checklist - [ ] Held-out check set assembled (includes the triggering failure), fixed for the revision - [ ] Edits bounded (~4 max), each generalizable and non-duplicative - [ ] Failure patterns addressed before success reinforcement - [ ] Candidate scored per-criterion against `best`; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly better - [ ] For Agent-compiled artifacts (down-skilling, creating-
GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
Securely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.
Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
>-
>-
Browse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.
Generate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".
Analyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.