Skip to main content
ClaudeWave
Skill963 repo starsupdated 3d ago

slo-error-budget

# slo-error-budget This Claude Code skill generates a comprehensive Service Level Objectives (SLO) document for a service, defining what to measure through Service Level Indicators (SLIs), setting reliability targets, calculating error budgets, and establishing burn rate alerts. Use this skill when tasked with creating SLO frameworks, defining reliability targets, calculating how much downtime a service can tolerate, establishing error budget policies, or building governance around planned versus unplanned outages.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/slo-error-budget && cp -r /tmp/slo-error-budget/plugins/pm-engineering/skills/slo-error-budget ~/.claude/skills/slo-error-budget
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# SLO and Error Budget Skill

Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.

A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.

## Required Inputs

Ask for these if not already provided:
- **Service name** and brief description of what it does
- **Primary users** — who depends on this service and how
- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
- **Existing on-call setup** — who responds to alerts?
- **Deployment frequency** — how often does the team ship?
- **Any existing SLAs** with customers — these constrain SLO targets

## Key Definitions

Always establish these before writing the SLO:

| Term | Definition |
|---|---|
| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
| **Burn rate** | How fast the error budget is being consumed |

---

## Output Format

---

# SLO Document: [Service Name]

**Service:** [Name] | **Team:** [Team name]
**Owner:** [Name / role] | **Approved by:** [Name]
**Effective date:** [Date] | **Review date:** [Date + 3 months]
**Version:** [1.0]

---

## Why This SLO Exists

[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]

---

## Service Overview

**What this service does:** [One sentence]
**Who depends on it:** [Internal teams / external customers / both — describe]
**Critical user journeys protected by this SLO:**
1. [Journey 1 — e.g. "User completes a payment"]
2. [Journey 2]
3. [Journey 3]

---

## SLIs — What We Measure

Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.

### SLI 1: [Name — e.g. Request Success Rate]

| Field | Detail |
|---|---|
| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
| **Measured over** | Rolling 28-day window |
| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |

### SLI 2: [Name — e.g. Latency]

| Field | Detail |
|---|---|
| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
| **Measurement source** | [Source] |
| **Measured over** | Rolling 28-day window |
| **Exclusions** | [Any exclusions] |

### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]

[Same structure]

---

## SLO Targets

| SLI | Target | Window | Error Budget |
|---|---|---|---|
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |

**How targets were set:**
- Historical baseline (last 90 days): [X]%
- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
- Rationale: [1–2 sentences]

**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]

---

## Error Budget Calculation

**For SLI 1 ([Name]), at [X]% target:**

```
Error budget = (100% - SLO target) × measurement window
             = (100% - [X]%) × 28 days × 24 hours × 60 minutes
             = [Y]% × [Z total minutes]
             = [N] minutes of allowed failure per 28-day window
```

**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.

---

## Burn Rate Alerts

Burn rate = how fast the error budget is being consumed relative to the budget window.
A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.

| Alert | Burn rate | Window | Severity | Response |
|---|---|---|---|---|
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |

**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]

---

## Error Budget Policy

This policy defines what to do with the error budget — both when it's healthy and when it's burning.

### When budget is healthy (>50% remaining)

- Feature development and deployments proceed at normal pace
- The team may take on riskier experiments
- Reliability improvements are scheduled but not urgent

### When budget is at risk (25–50% remaining)

- Deployment frequency reduced — team ships only well-tested changes
- One reliability improvement added to current sprint
- Weekly error budget review added to team standup

### When budget is nearly exhausted (<25% remaining)

- Feature work paused in favour of relia
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.