Skill1.2k repo starsupdated today

slo-error-budget

# slo-error-budget This Claude Code skill generates a comprehensive Service Level Objectives (SLO) document for a service, defining what to measure through Service Level Indicators (SLIs), setting reliability targets, calculating error budgets, and establishing burn rate alerts. Use this skill when tasked with creating SLO frameworks, defining reliability targets, calculating how much downtime a service can tolerate, establishing error budget policies, or building governance around planned versus unplanned outages.

View source Repository: pm-claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/slo-error-budget && cp -r /tmp/slo-error-budget/plugins/pm-engineering/skills/slo-error-budget ~/.claude/skills/slo-error-budget

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# SLO and Error Budget Skill

Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.

A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.

## Required Inputs

Ask for these if not already provided:
- **Service name** and brief description of what it does
- **Primary users** — who depends on this service and how
- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
- **Existing on-call setup** — who responds to alerts?
- **Deployment frequency** — how often does the team ship?
- **Any existing SLAs** with customers — these constrain SLO targets

## Key Definitions

Always establish these before writing the SLO:

| Term | Definition |
|---|---|
| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
| **Burn rate** | How fast the error budget is being consumed |

---

## Output Format

---

# SLO Document: [Service Name]

**Service:** [Name] | **Team:** [Team name]
**Owner:** [Name / role] | **Approved by:** [Name]
**Effective date:** [Date] | **Review date:** [Date + 3 months]
**Version:** [1.0]

---

## Why This SLO Exists

[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]

---

## Service Overview

**What this service does:** [One sentence]
**Who depends on it:** [Internal teams / external customers / both — describe]
**Critical user journeys protected by this SLO:**
1. [Journey 1 — e.g. "User completes a payment"]
2. [Journey 2]
3. [Journey 3]

---

## SLIs — What We Measure

Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.

### SLI 1: [Name — e.g. Request Success Rate]

| Field | Detail |
|---|---|
| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
| **Measured over** | Rolling 28-day window |
| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |

### SLI 2: [Name — e.g. Latency]

| Field | Detail |
|---|---|
| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
| **Measurement source** | [Source] |
| **Measured over** | Rolling 28-day window |
| **Exclusions** | [Any exclusions] |

### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]

[Same structure]

---

## SLO Targets

| SLI | Target | Window | Error Budget |
|---|---|---|---|
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |

**How targets were set:**
- Historical baseline (last 90 days): [X]%
- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
- Rationale: [1–2 sentences]

**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]

---

## Error Budget Calculation

**For SLI 1 ([Name]), at [X]% target:**

```
Error budget = (100% - SLO target) × measurement window
             = (100% - [X]%) × 28 days × 24 hours × 60 minutes
             = [Y]% × [Z total minutes]
             = [N] minutes of allowed failure per 28-day window
```

**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.

---

## Burn Rate Alerts

Burn rate = how fast the error budget is being consumed relative to the budget window.
A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.

| Alert | Burn rate | Window | Severity | Response |
|---|---|---|---|---|
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |

**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]

---

## Error Budget Policy

This policy defines what to do with the error budget — both when it's healthy and when it's burning.

### When budget is healthy (>50% remaining)

- Feature development and deployments proceed at normal pace
- The team may take on riskier experiments
- Reliability improvements are scheduled but not urgent

### When budget is at risk (25–50% remaining)

- Deployment frequency reduced — team ships only well-tested changes
- One reliability improvement added to current sprint
- Weekly error budget review added to team standup

### When budget is nearly exhausted (<25% remaining)

- Feature work paused in favour of relia