Skill390 repo starsupdated 7mo ago

managing-incidents

This Claude Code skill provides structured guidance for managing incidents across their full lifecycle, from detection through post-mortem analysis, emphasizing SRE best practices like blameless culture, severity classification, and clear communication protocols. Use it when establishing incident response processes, designing on-call rotations and escalation policies, creating runbooks, conducting post-mortems, implementing communication protocols, selecting incident management tools, or improving mean time to resolution metrics.

View source Repository: ai-design-components

Install in Claude Code

Copy

git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/managing-incidents && cp -r /tmp/managing-incidents/skills/managing-incidents ~/.claude/skills/managing-incidents

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Incident Management

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.

## When to Use This Skill

Apply this skill when:
- Setting up incident response processes for a team
- Designing on-call rotations and escalation policies
- Creating runbooks for common failure scenarios
- Conducting blameless post-mortems after incidents
- Implementing incident communication protocols (internal and external)
- Choosing incident management tooling and platforms
- Improving MTTR and incident frequency metrics

## Core Principles

### Incident Management Philosophy

**Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.

**Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.

**Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.

**Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.

**Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.

## Severity Classification

Standard severity levels with response times:

**SEV0 (P0) - Critical Outage:**
- Impact: Complete service outage, critical data loss, payment processing down
- Response: Page immediately 24/7, all hands on deck, executive notification
- Example: API completely down, entire customer base affected

**SEV1 (P1) - Major Degradation:**
- Impact: Major functionality degraded, significant customer subset affected
- Response: Page during business hours, escalate off-hours, IC assigned
- Example: 15% error rate, critical feature unavailable

**SEV2 (P2) - Minor Issues:**
- Impact: Minor functionality impaired, edge case bug, small user subset
- Response: Email/Slack alert, next business day response
- Example: UI glitch, non-critical feature slow

**SEV3 (P3) - Low Impact:**
- Impact: Cosmetic issues, no customer functionality affected
- Response: Ticket queue, planned sprint
- Example: Visual inconsistency, documentation error

For detailed severity decision framework and interactive classifier, see `references/severity-classification.md`.

## Incident Roles

**Incident Commander (IC):**
- Owns overall incident response and coordination
- Makes strategic decisions (rollback vs. debug, when to escalate)
- Delegates tasks to responders (does NOT do hands-on debugging)
- Declares incident resolved when stability confirmed

**Communications Lead:**
- Posts status updates to internal and external channels
- Coordinates with stakeholders (executives, product, support)
- Drafts post-incident customer communication
- Cadence: Every 15-30 minutes for SEV0/SEV1

**Subject Matter Experts (SMEs):**
- Hands-on debugging and mitigation
- Execute runbooks and implement fixes
- Provide technical context to IC

**Scribe:**
- Documents timeline, actions, decisions in real-time
- Records incident notes for post-mortem reconstruction

Assign roles based on severity:
- SEV2/SEV3: Single responder
- SEV1: IC + SME(s)
- SEV0: IC + Communications Lead + SME(s) + Scribe

For detailed role responsibilities, see `references/incident-roles.md`.

## On-Call Management

### Rotation Patterns

**Primary + Secondary:**
- Primary: First responder
- Secondary: Backup if primary doesn't ack within 5 minutes
- Rotation length: 1 week (optimal balance)

**Follow-the-Sun (24/7):**
- Team A: US hours, Team B: Europe hours, Team C: Asia hours
- Benefit: No night shifts, improved work-life balance
- Requires: Multiple global teams

**Tiered Escalation:**
- Tier 1: Junior on-call (common issues, runbook-driven)
- Tier 2: Senior on-call (complex troubleshooting)
- Tier 3: Team lead/architect (critical decisions)

### Best Practices

- Rotation length: 1 week per rotation
- Handoff ceremony: 30-minute call to discuss active issues
- Compensation: On-call stipend + time off after major incidents
- Tooling: PagerDuty, Opsgenie, or incident.io
- Limits: Max 2-3 pages per night; escalate if exceeded

## Incident Response Workflow

Standard incident lifecycle:

```
Detection → Triage → Declaration → Investigation
  ↓
Mitigation → Resolution → Monitoring → Closure
  ↓
Post-Mortem (within 48 hours)
```

### Key Decision Points

**When to Declare:** When in doubt, declare (can always downgrade severity)

**When to Escalate:**
- No progress after 30 minutes
- Severity increases (SEV2 → SEV1)
- Specialized expertise needed

**When to Close:**
- Issue resolved and stable for 30+ minutes
- Monitoring shows all metrics at baseline
- No customer-reported issues

For complete workflow details, see `references/incident-workflow.md`.

## Communication Protocols

### Internal Communication

**Incident Slack Channel:**
- Format: `#incident-YYYY-MM-DD-topic-description`
- Pin: Severity, IC name, status update template, runbook links

**War Room:** Video call for SEV0/SEV1 requiring real-time voice coordination

**Status Update Cadence:**
- SEV0: Every 15 minutes
- SEV1: Every 30 minutes
- SEV2: Every 1-2 hours or at major milestones

### External Communication

**Status Page:**
- Tools: Statuspage.io, Instatus, custom
- Stages: Investigating → Identified → Monitoring → Resolved
- Transparency: Acknowledge issue publicly, provide ETAs when possible

**Customer Email:**
- When: SEV0/SEV1 affecting customers
- Timing: Within 1 hour (acknowledge), post-resolution (full details)
- Tone: Apologetic, transparent, action-oriented

**Regulatory Notifications:**
- Data Breach: GDPR requires notification within 72 hours
- Financial Services: Immediate noti