Skill499 repo starsupdated 6d ago

incident-response

This incident response skill manages production incidents from initial detection through resolution by establishing structured roles, severity levels, and decision-making processes across five phases: detection, triage, mitigation, communication, and resolution. Use it when an active production incident occurs, a service is down, a security breach is suspected, or when building incident response procedures and on-call rotation protocols for a team.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/rampstackco/claude-skills /tmp/incident-response && cp -r /tmp/incident-response/dist/pi/.agents/skills/incident-response ~/.claude/skills/incident-response

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Incident Response

Manage active production incidents from detection to resolution. Stack-agnostic. Tool-agnostic.

This skill is for active incidents and incident process. For after-the-fact analysis, use `after-action-report`. For planned launches, use `launch-runbook`.

---

## When to use

- An active incident is happening
- Building incident response procedures
- Defining severity levels
- Setting up on-call rotations
- Training a team on incident response

## When NOT to use

- Post-incident retrospective (use `after-action-report`)
- Planned launches (use `launch-runbook`)
- Pre-launch issue triage (use `qa-testing`)

---

## Required inputs

- Awareness of the incident (alert, customer report, internal observation)
- Access to production systems and monitoring
- Roles and authorities clearly defined
- Communication channels operational

---

## The framework: 5 phases

### 1. Detection

How the incident becomes known.

**Detection sources:**

- Automated alerts (monitoring, SLO violations, error rate spikes)
- Customer reports (support tickets, social media, status page subscribers)
- Internal observation (engineer notices something off)
- Third-party (security researchers, partners)

**On detection:**

- Acknowledge within target time (typically 5 to 15 minutes for critical)
- Assess severity (see severity rubric below)
- Page the on-call if not already paged
- Open the incident channel

### 2. Triage

Establish severity and impact.

**Severity rubric:**

| Severity | Definition | Response |
|---|---|---|
| SEV-1 (Critical) | Major customer-facing functionality broken. Data integrity at risk. Security breach. | All-hands. Incident commander. Active war room. Public communication required. |
| SEV-2 (Major) | Significant degradation. Some customers affected. Revenue impact. | Incident commander assigned. Active response. Internal communication. May or may not need public communication. |
| SEV-3 (Minor) | Limited impact. Workaround available. Affecting a small group of users. | Standard on-call response. Single owner. |
| SEV-4 (Low) | Cosmetic, edge-case, or low-frequency. No urgent action needed. | Tracked as bug. Addressed in normal queue. |

Severity can change. Re-evaluate as more info emerges.

### 3. Mitigation

Stop the bleeding before fixing the cause.

**Mitigation patterns (faster than full fix):**

- **Rollback** (revert recent deploy)
- **Feature flag off** (disable the broken feature without deploy)
- **Failover** (route to healthy replica or region)
- **Scale up** (more capacity to absorb the load)
- **Throttle** (reject some traffic to protect the rest)
- **Graceful degradation** (turn off non-essential features to keep core functional)
- **Maintenance mode** (last resort, blocks all users)

**Mitigation principle:** Stop user impact first. Cause analysis second.

### 4. Communication

Three audiences during an incident:

**Internal team:**
- Real-time updates in incident channel
- Cadence: every 15 minutes minimum during active incident
- Format: timestamped status updates with what we know, what we're doing, ETA

**Internal stakeholders:**
- Higher-level updates to broader org
- Cadence: every 30 to 60 minutes
- Format: business-impact framing, not technical detail

**External / customers:**
- Status page updates
- Cadence: every 30 minutes minimum during active incident
- Format: plain language, no blame, what users are experiencing, what to expect

**Communication principles:**
- Acknowledge before you have answers ("We're aware and investigating")
- Update on schedule even if no progress ("Still investigating, no new information")
- Never speculate publicly about cause
- Confirm resolution explicitly when restored

### 5. Resolution

Verified fix, customers restored, incident closed.

**Resolution criteria:**

- Mitigation in place and verified
- Root cause identified (or explicitly deferred to AAR)
- All affected systems back to normal
- Customers can resume normal use
- Final status update posted (internal and external)
- Incident channel can be closed (or archived for AAR)

After closure:
- Schedule AAR within 1 to 2 weeks
- Capture initial timeline while memories are fresh
- Track follow-up action items

---

## Roles during an incident

| Role | Responsibility |
|---|---|
| Incident commander (IC) | Owns the response. Calls decisions. Assigns work. Not necessarily the most technical person; needs to coordinate. |
| Communications lead | Owns internal and external messaging. Reduces IC's communication burden. |
| Operations lead | Drives the technical investigation and mitigation. Often the most senior on-call engineer. |
| Scribe | Captures the timeline as the incident unfolds. Critical for AAR. |
| Subject matter experts | Pulled in as needed. Service owners, database experts, security experts. |

For small teams or low-severity incidents, one person can hold multiple roles. Each role's responsibilities should still be explicit.

---

## Decision-making during an incident

**The IC's authority:**

- Call rollback or other mitigations
- Pull additional people in
- Escalate severity
- Make the call when unclear options exist

**Non-decisions to avoid:**

- "Let's wait and see" when mitigations are available and impact is occurring
- Discussing root cause while users are actively impacted (mitigate first)
- Premature resolution announcements before verification
- Death-by-committee (pull in lots of people, no one decides)

When in doubt: act. A wrong action that can be rolled back beats inaction while users suffer.

---

## Status page communication patterns

**Initial:**
> "We are investigating reports of [issue]. Updates to follow."

**Identified:**
> "We have identified the issue affecting [scope]. Engineers are working on a fix. Next update by [time]."

**Monitoring:**
> "A fix has been applied. We are monitoring to confirm resolution. Next update by [time]."

**Resolved:**
> "This incident has been resolved. Service has been restored. A full incide

More from this repository

accessibility-auditSkill

Run a comprehensive WCAG accessibility audit covering perceivable, operable, understandable, and robust principles. Use this skill whenever the user wants to audit accessibility, review WCAG compliance, fix accessibility issues, prepare for accessibility certification, address an accessibility lawsuit risk, or systematically improve a site's accessibility. Triggers on accessibility audit, WCAG audit, a11y audit, accessibility compliance, ADA compliance, screen reader test, keyboard navigation, accessibility report, fix accessibility, axe scan. Also triggers when accessibility issues have been reported and need systematic remediation.

ads-creative-developmentSkill

How to produce ad creative that converts at performance scale. Hook patterns, format selection, video pacing, variation systems, sequential testing methodology, fatigue detection, brand-voice alignment without conversion dilution, and platform-specific creative norms. Triggers on ad creative, ad design, hook patterns, ad video pacing, creative testing, ad variations, creative refresh, creative fatigue, refresh ad creative, video ads for Meta, TikTok creative, LinkedIn ad creative, ad asset library. Also triggers when a team is producing creative at scale, planning a creative test cycle, or auditing why creative is not converting.

ads-performance-analyticsSkill

How to read paid media dashboards without fooling yourself. Attribution models, platform reporting quirks, multi-platform reconciliation, ROAS vs LTV horizon traps, statistical noise in performance metrics, incrementality testing, and the failure modes that produce expensive lessons. Triggers on read paid media dashboard, attribution analysis, ROAS vs LTV, multi-platform reconciliation, ad incrementality, geo holdout, conversion lift study, ghost bidding, paid media reporting, board-deck paid media metrics, blended CAC, MMM, MTA, last-click attribution. Also triggers when a marketer is about to scale, kill, or rebudget a campaign based on platform metrics, or when reconciling platform reports against warehouse revenue.

after-action-reportSkill

Run a structured after-action review (postmortem, retrospective) on a launch, incident, or completed project to capture timeline, root cause analysis, contributing factors, and actionable lessons. Use this skill whenever the user wants to run a postmortem, retrospective, AAR, or after-action review on any past event. Triggers on after-action report, AAR, postmortem, retrospective, retro, post-incident review, what went well what didn't, lessons learned, blameless postmortem, root cause analysis, RCA, five whys. Also triggers when the user has just shipped something or just resolved an incident and wants to capture learnings.

ai-content-collaborationSkill

How humans and AI compose in content workflows. Where AI legitimately participates, where humans must own, hybrid workflow patterns, voice ownership preservation, the AI slop problem, disclosure and transparency, team calibration, and the ethics of intellectually honest AI-assisted content production. Triggers on AI content workflow, AI-assisted writing, hybrid content production, AI in editorial, AI slop, AI disclosure, AI usage policy, AI content ethics, voice preservation with AI, team AI calibration. Also triggers when content feels generic despite quality tools, when team AI usage has drifted into inconsistency, or when a regulated or trust-sensitive context requires explicit AI policy.

analytics-strategySkill

Design measurement frameworks including event taxonomy, KPI hierarchy, dashboard architecture, attribution models, and analytics implementation strategy. Use this skill whenever the user wants to plan analytics, design dashboards, build event taxonomies, define KPIs, set up tracking, or audit existing measurement. Triggers on analytics strategy, measurement plan, event taxonomy, tracking plan, KPI framework, dashboard design, north star metric, attribution model, conversion tracking, GA4 setup, Mixpanel setup, analytics audit. Also triggers when the user has data but no clear way to use it, or wants to make decisions but doesn't know what to track.

art-directionSkill

Direct visual and creative work for campaigns, photography, illustration, video, and branded experiences. Use this skill whenever the user wants to brief a photographer, direct illustrators, plan a creative campaign, develop visual concepts, write a creative direction document, or evaluate creative work for fit. Triggers on art direction, photo brief, photography brief, illustration brief, campaign concept, creative concept, visual direction, mood board, look and feel, visual treatment, video direction. Also triggers when the user has approved brand identity but needs to extend it into specific creative deliverables.

backup-and-disaster-recoverySkill

Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.