managing-incidents
This Claude Code skill provides structured guidance for managing incidents across their full lifecycle, from detection through post-mortem analysis, emphasizing SRE best practices like blameless culture, severity classification, and clear communication protocols. Use it when establishing incident response processes, designing on-call rotations and escalation policies, creating runbooks, conducting post-mortems, implementing communication protocols, selecting incident management tools, or improving mean time to resolution metrics.
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/managing-incidents && cp -r /tmp/managing-incidents/skills/managing-incidents ~/.claude/skills/managing-incidentsSKILL.md
# Incident Management Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations. ## When to Use This Skill Apply this skill when: - Setting up incident response processes for a team - Designing on-call rotations and escalation policies - Creating runbooks for common failure scenarios - Conducting blameless post-mortems after incidents - Implementing incident communication protocols (internal and external) - Choosing incident management tooling and platforms - Improving MTTR and incident frequency metrics ## Core Principles ### Incident Management Philosophy **Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response. **Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored. **Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning. **Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging. **Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents. ## Severity Classification Standard severity levels with response times: **SEV0 (P0) - Critical Outage:** - Impact: Complete service outage, critical data loss, payment processing down - Response: Page immediately 24/7, all hands on deck, executive notification - Example: API completely down, entire customer base affected **SEV1 (P1) - Major Degradation:** - Impact: Major functionality degraded, significant customer subset affected - Response: Page during business hours, escalate off-hours, IC assigned - Example: 15% error rate, critical feature unavailable **SEV2 (P2) - Minor Issues:** - Impact: Minor functionality impaired, edge case bug, small user subset - Response: Email/Slack alert, next business day response - Example: UI glitch, non-critical feature slow **SEV3 (P3) - Low Impact:** - Impact: Cosmetic issues, no customer functionality affected - Response: Ticket queue, planned sprint - Example: Visual inconsistency, documentation error For detailed severity decision framework and interactive classifier, see `references/severity-classification.md`. ## Incident Roles **Incident Commander (IC):** - Owns overall incident response and coordination - Makes strategic decisions (rollback vs. debug, when to escalate) - Delegates tasks to responders (does NOT do hands-on debugging) - Declares incident resolved when stability confirmed **Communications Lead:** - Posts status updates to internal and external channels - Coordinates with stakeholders (executives, product, support) - Drafts post-incident customer communication - Cadence: Every 15-30 minutes for SEV0/SEV1 **Subject Matter Experts (SMEs):** - Hands-on debugging and mitigation - Execute runbooks and implement fixes - Provide technical context to IC **Scribe:** - Documents timeline, actions, decisions in real-time - Records incident notes for post-mortem reconstruction Assign roles based on severity: - SEV2/SEV3: Single responder - SEV1: IC + SME(s) - SEV0: IC + Communications Lead + SME(s) + Scribe For detailed role responsibilities, see `references/incident-roles.md`. ## On-Call Management ### Rotation Patterns **Primary + Secondary:** - Primary: First responder - Secondary: Backup if primary doesn't ack within 5 minutes - Rotation length: 1 week (optimal balance) **Follow-the-Sun (24/7):** - Team A: US hours, Team B: Europe hours, Team C: Asia hours - Benefit: No night shifts, improved work-life balance - Requires: Multiple global teams **Tiered Escalation:** - Tier 1: Junior on-call (common issues, runbook-driven) - Tier 2: Senior on-call (complex troubleshooting) - Tier 3: Team lead/architect (critical decisions) ### Best Practices - Rotation length: 1 week per rotation - Handoff ceremony: 30-minute call to discuss active issues - Compensation: On-call stipend + time off after major incidents - Tooling: PagerDuty, Opsgenie, or incident.io - Limits: Max 2-3 pages per night; escalate if exceeded ## Incident Response Workflow Standard incident lifecycle: ``` Detection → Triage → Declaration → Investigation ↓ Mitigation → Resolution → Monitoring → Closure ↓ Post-Mortem (within 48 hours) ``` ### Key Decision Points **When to Declare:** When in doubt, declare (can always downgrade severity) **When to Escalate:** - No progress after 30 minutes - Severity increases (SEV2 → SEV1) - Specialized expertise needed **When to Close:** - Issue resolved and stable for 30+ minutes - Monitoring shows all metrics at baseline - No customer-reported issues For complete workflow details, see `references/incident-workflow.md`. ## Communication Protocols ### Internal Communication **Incident Slack Channel:** - Format: `#incident-YYYY-MM-DD-topic-description` - Pin: Severity, IC name, status update template, runbook links **War Room:** Video call for SEV0/SEV1 requiring real-time voice coordination **Status Update Cadence:** - SEV0: Every 15 minutes - SEV1: Every 30 minutes - SEV2: Every 1-2 hours or at major milestones ### External Communication **Status Page:** - Tools: Statuspage.io, Instatus, custom - Stages: Investigating → Identified → Monitoring → Resolved - Transparency: Acknowledge issue publicly, provide ETAs when possible **Customer Email:** - When: SEV0/SEV1 affecting customers - Timing: Within 1 hour (acknowledge), post-resolution (full details) - Tone: Apologetic, transparent, action-oriented **Regulatory Notifications:** - Data Breach: GDPR requires notification within 72 hours - Financial Services: Immediate noti
Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.
Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).
Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.
Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.
Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.
Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.
Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.
Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.