Skip to main content
ClaudeWave
Skill963 estrellas del repoactualizado 4d ago

oncall-runbook

This skill generates a structured on-call runbook for a service, including alert response procedures, escalation matrices, diagnostic commands, and handoff templates designed for rapid incident response. Use it when documenting on-call procedures, creating alert response guides, establishing escalation paths, or preparing shift handoff materials for engineering teams managing service incidents.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/oncall-runbook && cp -r /tmp/oncall-runbook/plugins/pm-engineering/skills/oncall-runbook ~/.claude/skills/oncall-runbook
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# On-Call Runbook Skill

Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.

A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.

## Required Inputs

Ask for these if not already provided:
- **Service name** and what it does
- **Team** and tech lead name
- **Alert list** — names of alerts that currently page on-call
- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc.
- **Common failure modes** — what breaks most often, and what fixes it
- **Escalation contacts** — who to call when on-call can't resolve it
- **Deployment setup** — can on-call roll back? How?
- **Service dependencies** — what does this service depend on, and what depends on it?

## Output Format

---

# On-Call Runbook: [Service Name]

**Team:** [Team name] | **Tech lead:** [Name]
**PagerDuty service:** [Link] | **Escalation policy:** [Policy name]
**Last updated:** [Date] | **Next review:** [Date + 90 days]

> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.

---

## Quick Reference

**Dashboard:** [Link — the first thing to open when paged]
**Logs:** [Link — where to find logs]
**Runbook index:** Jump to the alert that paged you → [Alert list below]
**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty]

**Rollback command (memorise this):**
```bash
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
```

---

## Escalation Matrix

| Situation | Escalate to | How | After how long |
|---|---|---|---|
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately |
| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) |
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |

**Contacts:**

| Name | Role | Slack | Phone |
|---|---|---|---|
| [Name] | Tech lead | @[handle] | [Number] |
| [Name] | Engineering manager | @[handle] | [Number] |
| [Name] | Platform / infra | @[handle] | [Number] |
| [Platform team] | Infra on-call | `#platform` | PagerDuty |

---

## Service Architecture (Quick View)

```
[Upstream callers]
        │
        ▼
[This Service]
        │
        ├──→ [Primary Database]
        ├──→ [Cache — e.g. Redis]
        └──→ [Downstream Service / Queue]
```

**If this service is down, these are affected:** [List downstream consumers]
**If these are down, this service is affected:** [List upstream dependencies]

---

## Alert Runbooks

### ALERT: [Alert Name 1 — e.g. HighErrorRate]

**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"]
**Severity:** P1 / P2 / P3
**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]

**Step 1 — Acknowledge and assess**
```bash
# Check current error rate
[query or dashboard link]

# Check which endpoints are erroring
[query or command]
```

**Step 2 — Check recent changes**
```bash
# Any deploys in the last hour?
[command or link to deployment log]

# Recent config changes?
[where to check]
```

**Step 3 — Check dependencies**
```bash
# Is the database healthy?
[health check command or link]

# Is [downstream service] healthy?
[health check command or link]
```

**Step 4 — Diagnose**

| If you see | It means | Do this |
|---|---|---|
| [Error pattern 1] | [Cause] | [Action] |
| [Error pattern 2] | [Cause] | [Action] |
| [Error pattern 3] | [Cause] | [Action] |
| No clear pattern | Unknown cause | Escalate to [name] |

**Step 5 — Fix or mitigate**
```bash
# If caused by bad deploy — roll back:
[rollback command]

# If caused by [specific issue]:
[fix command]

# If caused by upstream dependency:
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
```

**After resolving:**
- [ ] Confirm error rate has returned to baseline
- [ ] Check no downstream services were affected
- [ ] If P1: open a post-incident review — see [incident-postmortem skill]
- [ ] Update `#incidents` with resolution summary

---

### ALERT: [Alert Name 2 — e.g. HighLatency]

**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"]
**Severity:** P1 / P2 / P3
**SLO impact:** Yes — latency SLO breach

**Step 1 — Assess scope**
```bash
# Check which endpoints are slow
[query or dashboard — broken down by endpoint]

# Check if latency is across all regions or localised
[query or command]
```

**Step 2 — Common causes and fixes**

| Cause | Signal | Fix |
|---|---|---|
| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] |
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] |

**Step 3 — Escalate if unresolved in 20 minutes**
Page [Tech lead] via PagerDuty / Slack.

---

### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted]

**What it means:** [e.g. "The service has used all available database connections — new requests will fail"]
**Severity:** P1
**SLO impact:** Yes — will cause errors immediately

**Immediate mitigation:**
```bash
# Restart the service to flush stale connections
[restart command]

# Che
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.