Skip to main content
ClaudeWave
Skill171 estrellas del repoactualizado 1mo ago

Incident Response

Structured production incident triage, resolution, and post-mortem. Apply when production systems are down, degraded, or behaving unexpectedly. Covers detection, containment, resolution, and learning.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/ThamJiaHe/claude-code-handbook /tmp/incident-response && cp -r /tmp/incident-response/skills/examples/incident-response- ~/.claude/skills/incident-response
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

incident-response-skill.md

# Incident Response

Structured workflow for production incidents: detect, contain, resolve, learn.

## When to Use

- Production service is down or degraded
- Users reporting errors or data issues
- Monitoring alerts firing
- Security incident detected
- Data integrity issue discovered

## Severity Classification

| Severity | Impact | Response Time | Examples |
|:--------:|--------|:---:|---------|
| **SEV-1** | Full outage, data breach, revenue loss | Immediate | Service down, auth broken, data leak |
| **SEV-2** | Major degradation, subset of users affected | < 30 min | Slow responses, feature broken, partial outage |
| **SEV-3** | Minor impact, workaround available | < 4 hours | UI bug, non-critical feature down |
| **SEV-4** | No user impact, internal issue | Next business day | Log errors, monitoring gaps |

## Phase 1: Detection & Triage (First 5 Minutes)

```markdown
## Incident Triage

**Time detected:** [timestamp]
**Reporter:** [who noticed]
**Severity:** SEV-[1-4]

### What's happening?
[One sentence description of symptoms]

### Who's affected?
[All users / subset / internal only]

### What changed recently?
[Deployments, config changes, traffic spikes]
```

### Quick Diagnostic Commands

```bash
# Check recent deployments
git log --oneline -5 --since="2 hours ago"

# Check application logs
tail -100 /var/log/app/error.log | grep -i "error\|fatal\|panic"

# Check system resources
top -bn1 | head -20
df -h
free -h

# Check database connectivity
pg_isready -h $DB_HOST

# Check external service health
curl -sI https://api.stripe.com/v1 -o /dev/null -w "%{http_code}"
```

## Phase 2: Containment (Minutes 5-15)

**Goal:** Stop the bleeding. Don't fix root cause yet.

| Containment Action | When to Use |
|---|---|
| **Rollback deployment** | Problem started after a deploy |
| **Feature flag disable** | New feature causing issues |
| **Scale up** | Traffic/load related |
| **Failover to backup** | Primary system unrecoverable |
| **Rate limit** | Being overwhelmed by requests |
| **Block bad actor** | Malicious traffic identified |

```bash
# Quick rollback (if deployment caused it)
git revert HEAD --no-edit && git push

# Or revert to last known good deployment
# (platform-specific: Vercel, Railway, etc.)
```

## Phase 3: Resolution

### Hypothesis-Driven Debugging

```markdown
## Debugging Log

### Hypothesis 1: [Most likely cause]
Evidence for: [what supports this]
Evidence against: [what contradicts this]
Test: [how to verify]
Result: [confirmed / rejected]

### Hypothesis 2: [Second most likely]
...
```

### Common Root Causes

| Symptom | Common Cause | Quick Fix |
|---------|-------------|-----------|
| 500 errors spike | Bad deployment | Rollback |
| Slow responses | Database query regression | Kill slow queries, add index |
| Connection timeouts | Connection pool exhaustion | Restart, increase pool size |
| OOM crashes | Memory leak | Restart, set memory limits |
| Auth failures | Token/cert expiry | Rotate credentials |
| Data inconsistency | Race condition | Add locking, retry logic |

## Phase 4: Post-Mortem

Write within 48 hours of resolution. **Blameless** — focus on systems, not people.

```markdown
# Post-Mortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** [start time] to [resolution time] ([X] minutes)
**Severity:** SEV-[X]
**Author:** [Name]

## Summary
[2-3 sentence summary of what happened and impact]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | [First symptom / alert] |
| HH:MM | [Incident declared] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix deployed] |
| HH:MM | [Confirmed resolved] |

## Root Cause
[Technical explanation of what went wrong]

## Impact
- Users affected: [number or percentage]
- Duration: [minutes]
- Revenue impact: [if applicable]
- Data affected: [if applicable]

## What Went Well
- [Good thing 1]
- [Good thing 2]

## What Went Wrong
- [Process/system gap 1]
- [Process/system gap 2]

## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|:--------:|
| [Preventive measure] | [Name] | [Date] | P1 |
| [Detection improvement] | [Name] | [Date] | P2 |
| [Process improvement] | [Name] | [Date] | P3 |

## Lessons Learned
[What the team should take away from this]
```

## Communication Template

```markdown
## Status Update: [Incident Title]

**Status:** Investigating / Identified / Monitoring / Resolved
**Impact:** [Who/what is affected]
**Current action:** [What we're doing right now]
**ETA:** [When we expect resolution, if known]
**Next update:** [When the next status update will be]
```

## Sources

- [PagerDuty Incident Response Guide](https://response.pagerduty.com/)
- [Google SRE Handbook — Incident Management](https://sre.google/sre-book/managing-incidents/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
API DevelopmentSkill

Build REST APIs with proper error handling, status codes, request validation, response formatting, and rate limiting. Apply when creating API routes, handling errors, validating input, or designing API responses.

API Security HardeningSkill

Harden REST and GraphQL APIs against common attack vectors. Apply when building API endpoints, implementing authentication, handling file uploads, or exposing APIs to external consumers.

AWS Cloud InfrastructureSkill

Deploy Node.js applications on AWS using EC2, RDS, and managed services with security best practices. Apply when setting up AWS infrastructure, configuring databases, managing security, or optimizing costs.

Build Error ResolverSkill

Rapidly fix build failures, type errors, and lint issues with minimal diffs. Apply when builds fail, TypeScript reports errors, or CI/CD pipelines break. Focuses on getting the build green fast.

Cybersecurity Threat ModelingSkill

STRIDE-based threat modeling for application architecture. Apply when designing new systems, reviewing architecture, or assessing security posture of existing applications.

Docker ContainerizationSkill

Production-ready Docker patterns for multi-stage builds, security hardening, and orchestration. Apply when creating Dockerfiles, docker-compose configs, or deploying containerized applications.

Git WorkflowSkill

Enforces Conventional Commits, PR standards, merge conflict resolution, and branch management. Apply when committing code, opening PRs, resolving conflicts, managing branches, or handling Git operations.

Google Cloud Platform & APIsSkill

Deploy Node.js applications on Google Cloud with Cloud Run, Cloud Firestore, and Google APIs. Implement OAuth2 authentication and manage service accounts. Apply when building serverless applications, integrating Google services, or deploying to GCP.