Skip to main content
ClaudeWave
Skill963 repo starsupdated 3d ago

incident-postmortem

This skill generates a complete, blameless incident postmortem document following industry standards, including executive summary, impact analysis, chronological timeline, root cause identification, contributing factors, and action items. Use it when documenting any incident, outage, P1/P2/P3 review, or root cause analysis to ensure consistent, blameless framing that emphasizes system gaps over individual failures and drives toward specific, closeable remediation actions.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/incident-postmortem && cp -r /tmp/incident-postmortem/plugins/pm-engineering/skills/incident-postmortem ~/.claude/skills/incident-postmortem
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Incident Postmortem Skill

This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.

## Required Inputs

Ask the user for these if not provided:
- **Incident title / ID**
- **Severity** (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
- **Date and duration** of the incident
- **What happened** (rough notes are fine — the skill will structure them)
- **Services or systems affected**
- **Customer impact** (how many users, what was degraded)
- **How it was detected**
- **How it was resolved**
- **Initial thoughts on root cause**
- **Action items already identified** (optional)
- **Responders** (who was on-call or responded — names or roles; used for the timeline, not for blame)
- **Customer or external communications sent** (optional — any status page updates, emails, or support messages with timestamps)

## Output Format

---

# Incident Postmortem: [Incident Title]

**Incident ID:** [ID]
**Severity:** [P1/P2/P3]
**Date:** [Date]
**Duration:** [Start time → Resolution time — total duration]
**Status:** [Resolved / Monitoring / Ongoing]
**Author:** [Leave blank for user to fill]
**Last updated:** [Date]

---

## Executive Summary

[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]

---

## Impact

| Dimension | Details |
|---|---|
| **Users affected** | [Number or percentage] |
| **Services degraded** | [List affected services] |
| **Business impact** | [Revenue, SLA breach, support tickets, etc. if known] |
| **Duration** | [Total time from first detection to full resolution] |

---

## Timeline

List events in chronological order. Each entry: `[HH:MM UTC] — [What happened. Who did what. What changed.]`

Rules for timeline entries:
- Use passive or system-focused language — avoid "X made a mistake"
- Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
- Note time between key events (e.g. "22 minutes between detection and escalation")

---

## Root Cause

**Primary root cause:** [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]

**Contributing factors:**
- [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
- [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
- [Factor 3 — add as many as are relevant]

**Why did our existing safeguards not prevent this?**
[Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]

---

## Detection

- **How was it first detected?** [Customer report / automated alert / internal monitoring / manual observation]
- **Time from incident start to detection:** [X minutes]
- **Should we have detected this faster?** [Yes / No — and why]

---

## Resolution

**What fixed it?** [Clear description of the actual fix — one paragraph]
**Why did this work?** [Brief technical explanation]
**Was there a temporary mitigation before full resolution?** [Yes/No — describe if yes]

---

## Action Items

| # | Action | Owner | Due Date | Priority |
|---|---|---|---|---|
| 1 | [Specific, testable action] | [Team or person] | [Date] | P1/P2/P3 |

Rules for action items:
- Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
- Distinguish between: **Prevent recurrence** (fix the root cause), **Improve detection** (catch it faster next time), **Improve response** (resolve it faster next time)
- Assign a real owner — not "team" or "TBD" if avoidable
- Flag P1 actions as items that block the incident from being marked fully closed

---

## What Went Well

[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]

---

## Lessons Learned

[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]

---

## Communication Log

[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]

---

## Quality Checks

- [ ] Timeline has no blame-focused language
- [ ] Root cause is specific (not "human error")
- [ ] Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
- [ ] Contributing factors explain the systemic gaps
- [ ] Every action item has an owner and due date
- [ ] "What went well" section is genuine, not token
- [ ] No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
- [ ] Executive summary is readable by non-technical leadership

## Anti-Patterns

- [ ] Do not assign blame to individuals — postmortems must focus on system and process failures
- [ ] Do not write action items with vague language like "improve monitoring" — each must name a specific, ownable change
- [ ] Do not skip the contributing factors — root cause alone misses the systemic issues that enable incidents
- [ ] Do not omit the detection timeline — how long it took to detect matters as much as how long it took to resolve
- [ ] Do not treat the postmortem as closed until all action items have named owners and due dates

## Usage Examples
- "Write a postmortem for the [incident name] outage"
- "Help me write a P1 incident report"
- "Generate an RCA document for [service] going down on [date]"
- "Draft a blameless
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.