Skip to main content
ClaudeWave
Skill963 estrellas del repoactualizado 4d ago

disaster-recovery-plan

# Disaster Recovery Plan Skill This Claude Code skill generates a comprehensive disaster recovery plan for a service or system, including Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets, detailed runbooks for specific failure scenarios, backup and restore procedures, testing schedules, and communication templates. Use it when documenting DR strategies, creating recovery procedures, defining recovery targets, or preparing DR test exercises for on-call teams.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/disaster-recovery-plan && cp -r /tmp/disaster-recovery-plan/plugins/pm-engineering/skills/disaster-recovery-plan ~/.claude/skills/disaster-recovery-plan
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Disaster Recovery Plan Skill

Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.

## Required Inputs

Ask for these if not already provided:
- **Service name** and what it does (business function and technical role)
- **Criticality tier** — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
- **Current infrastructure setup** — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
- **RPO/RTO requirements** — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
- **Backup strategy** — what is backed up, how often, where backups are stored, retention policy
- **On-call contacts** — names and contact details for the responder chain

## Output Format

---

# Disaster Recovery Plan: [Service Name]

**Team:** [Team name] | **Tech lead:** [Name]
**Criticality tier:** [Tier 1 / Tier 2 / Tier 3] | **Last tested:** [Date]
**Next DR test:** [Date] | **Document owner:** [Name]
**Last updated:** [Date] | **Review cycle:** Quarterly

> **Emergency? Skip to Section 3 — Failure Scenario Runbooks.** Find the scenario that matches your situation and follow the steps exactly.

---

## 1. Recovery Targets

| Target | Value | Rationale |
|---|---|---|
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |

**What these mean in practice:**
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
- If either target cannot be met, escalate to [Engineering Manager] immediately.

---

## 2. Failure Scenario Inventory

| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|---|---|---|---|---|---|
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |

---

## 3. Failure Scenario Runbooks

### 3.1 Single Availability Zone Failure

**Trigger:** One AZ becomes unreachable — pods/instances in that zone stop responding.
**Detection:** PagerDuty alert `[AlertName]` fires, or cloud provider status page shows AZ degradation.
**Expected RTO:** [15 minutes] | **Expected RPO:** Zero (no data loss if multi-AZ replication is working)

**Step 1 — Confirm the failure**
```bash
# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running

# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready

# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com
```

**Step 2 — Assess whether auto-recovery has occurred**
```bash
# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch

# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]

# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]
```

**Step 3 — Force rescheduling if auto-recovery stalled**
```bash
# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]

# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]
```

**Step 4 — Verify service health**
```bash
# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]

# Check error rate in monitoring
[dashboard link or query]
```

**Recovery confirmed when:** All pods are Running, health check returns 200, error rate is at baseline.

---

### 3.2 Full Region Failure

**Trigger:** The primary region is entirely unavailable.
**Detection:** All service health checks failing, cloud provider status page confirms region-wide event.
**Expected RTO:** [60 minutes] | **Expected RPO:** [5 minutes — based on cross-region replication lag]

**Step 1 — Confirm regional failure (5 minutes)**
```bash
# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"

# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
```

**Step 2 — Declare DR event and notify (2 minutes)**

Post to `#incidents`:
```
🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes
```

Page [Engineering Manager] and [CTO/VP Eng] via Page
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.