Skip to main content
ClaudeWave
Skill963 estrellas del repoactualizado 4d ago

runbook-writer

The runbook-writer skill generates structured operational guides for services, incidents, and deployments with sections covering overview, prerequisites, step-by-step procedures, rollback steps, troubleshooting, and escalation paths. Use it when tasked with creating runbooks, ops guides, incident response playbooks, or deployment procedures that enable on-call engineers unfamiliar with a system to execute operations reliably under pressure.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/runbook-writer && cp -r /tmp/runbook-writer/plugins/pm-engineering/skills/runbook-writer ~/.claude/skills/runbook-writer
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Runbook Writer Skill

Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.

## Required Inputs

Ask for these if not provided:
- **What the runbook is for** (e.g. deploying the payment service, responding to a database failover, rotating API keys)
- **Runbook type** (Deployment / Incident Response / Maintenance / Disaster Recovery)
- **System/service name and what it does** (brief description)
- **Audience** (new on-call engineers / experienced SREs / DevOps team)
- **Tech stack** (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
- **Monitoring tools** (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
- **Key environment details** (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)

## Output Format

---
**Runbook:** [Runbook Title]
**Service:** [Service Name]
**Type:** [Deployment / Incident Response / Maintenance / DR]
**Last Updated:** [Insert today's date in YYYY-MM-DD format]
**Owner:** [Team or person]
**Severity:** [P1 / P2 / P3 — if incident-type]

---

### Overview
**What this runbook covers:**
[1–2 sentences on the scenario this runbook handles]

**When to use this runbook:**
- [Specific trigger condition 1 — e.g. PagerDuty alert: `high-error-rate-payment-service`]
- [Specific trigger condition 2 — e.g. Deploy needed after PR merged to `main`]

**Estimated time to complete:** [X minutes / X–Y minutes depending on outcome]

**Impact if not completed correctly:** [e.g. Payment processing degraded / Data loss risk / Users locked out]

---

### Prerequisites

**Access required:**
- [ ] [System/tool access — e.g. AWS Console: `production-account`]
- [ ] [Credential — e.g. `vault read secret/payment-service`]
- [ ] [VPN / bastion access if needed]

**Tools required:**
- [ ] [Tool name and version — e.g. `kubectl` v1.28+]
- [ ] [CLI or dashboard name]

**Before you start:**
- [ ] [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
- [ ] [Prerequisite action — e.g. Announce in `#ops-live` that you're starting]

---

### Procedure

Number every step. Use exact commands. Do not paraphrase tool names or flags.

**Step 1: [Action name]**
[What you're doing and why — one sentence]
```bash
# Exact command
[command here]
```
**Expected output:** `[what should appear if this worked]`
**If this fails:** [Exact error message to look for] → [What to do, or see Troubleshooting]

**Step 2: [Action name]**
[Same structure as Step 1]

**Step 3: Verify**
Always include a verification step after the main procedure:
```bash
[verification command]
```
**Expected state:** [What a healthy system looks like after this runbook completes]

---

### Rollback

How to undo this procedure if something went wrong:

**Step R1: [Rollback action]**
```bash
[rollback command]
```
**Verify rollback:** `[command to confirm rollback succeeded]`

---

### Troubleshooting

| Symptom | Likely Cause | Resolution |
|---|---|---|
| [Error message or observable symptom] | [Why this happens] | [Exact fix or next step] |
| [Another symptom] | [Cause] | [Resolution] |

---

### Escalation

If this runbook does not resolve the issue:

| Condition | Who to Contact | How |
|---|---|---|
| [e.g. DB unavailable after 10 min] | [DBA on-call] | [PagerDuty policy: `db-oncall`] |
| [e.g. Payment provider unresponsive] | [Vendor contact] | [Contact in 1Password: `vendor-escalation`] |

**Always update the incident timeline in [tool] before escalating.**

---

### Post-Procedure Checklist

After completing the runbook:
- [ ] Announce completion in `#ops-live` with outcome
- [ ] Update the incident ticket / deploy log
- [ ] Verify alerts have resolved in monitoring dashboard
- [ ] If this revealed a gap in this runbook — update it now (link to edit process)

---

## Quality Checks
- [ ] Every step has an exact command (no "run the deploy script")
- [ ] Expected output is specified for each step so engineer knows if it worked
- [ ] Failure path is explicit for each step (not "if it fails, investigate")
- [ ] Rollback procedure is complete and independently testable
- [ ] Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
- [ ] Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
- [ ] Runbook can be followed by someone who has never touched this system

## Usage Examples
- "Write a runbook for [service] deployment"
- "Create an incident response runbook for [alert type]"
- "I need a runbook for [procedure]"
- "Document the operational procedure for [X]"
- "Write an ops playbook for [scenario]"

## Anti-Patterns

- [ ] Do not write steps as vague actions like "run the deploy script" — every step must include the exact command
- [ ] Do not leave the rollback section as a placeholder — a runbook without a tested rollback procedure is incomplete and dangerous
- [ ] Do not omit expected output for each step — without it, the on-call engineer cannot tell if the step succeeded
- [ ] Do not write escalation contacts as "[Team name]" — every escalation row must have a real contact or an explicit flag to fill in
- [ ] Do not assume the reader knows the system — write for someone who has never touched it before
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.