Skill390 repo starsupdated 7mo ago

planning-disaster-recovery

This skill provides comprehensive guidance for designing disaster recovery strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Use it when defining RTO/RPO objectives, implementing database backups with point-in-time recovery, configuring cross-region replication, testing disaster recovery through chaos engineering, or meeting compliance requirements.

View source Repository: ai-design-components

Install in Claude Code

Copy

git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/planning-disaster-recovery && cp -r /tmp/planning-disaster-recovery/skills/planning-disaster-recovery ~/.claude/skills/planning-disaster-recovery

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Disaster Recovery

## Purpose

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

## When to Use This Skill

Invoke this skill when:
- Defining recovery time objectives (RTO) and recovery point objectives (RPO)
- Implementing database backups with point-in-time recovery (PITR)
- Setting up Kubernetes cluster backup and restore workflows
- Configuring cross-region replication for high availability
- Testing disaster recovery procedures through chaos experiments
- Meeting compliance requirements (GDPR, SOC 2, HIPAA)
- Automating backup monitoring and alerting
- Designing multi-cloud disaster recovery architectures

## Core Concepts

### RTO and RPO Fundamentals

**Recovery Time Objective (RTO):** Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

**Recovery Point Objective (RPO):** Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

**Criticality Tiers:**
- **Tier 0 (Mission-Critical):** RTO < 1 hour, RPO < 5 minutes
- **Tier 1 (Production):** RTO 1-4 hours, RPO 15-60 minutes
- **Tier 2 (Important):** RTO 4-24 hours, RPO 1-6 hours
- **Tier 3 (Standard):** RTO > 24 hours, RPO > 6 hours

### 3-2-1 Backup Rule

Maintain **3 copies** of data on **2 different media** types with **1 copy offsite**.

Example implementation:
- Primary: Production database
- Secondary: Local backup storage
- Tertiary: Cloud backup (S3/GCS/Azure)

### Backup Types

**Full Backup:** Complete copy of all data. Slowest to create, fastest to restore.

**Incremental Backup:** Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

**Differential Backup:** Changes since last full backup. Balance between storage and restore speed.

**Continuous Backup:** Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

## Quick Decision Framework

### Step 1: Map RTO/RPO to Strategy

```
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest

RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High

RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium

RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
```

### Step 2: Select Backup Tools by Use Case

| Use Case | Primary Tool | Alternative | Key Feature |
|----------|-------------|-------------|-------------|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |

## Database Backup Patterns

### PostgreSQL with pgBackRest

**Use Case:** Production PostgreSQL with < 5 minute RPO

**Quick Start:** See `examples/postgresql/pgbackrest-config/`

Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with `pgbackrest --stanza=main --delta restore`.

**Detailed Guide:** `references/database-backups.md#postgresql`

### MySQL with Percona XtraBackup

**Use Case:** MySQL production requiring hot backups

**Quick Start:** See `examples/mysql/xtrabackup/`

Perform full (`xtrabackup --backup --parallel=4`) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

**Detailed Guide:** `references/database-backups.md#mysql`

### MongoDB Backup

**Quick Start:** Use `mongodump --gzip --numParallelCollections=4` for logical backups or MongoDB Atlas for continuous backup with PITR.

**Detailed Guide:** `references/database-backups.md#mongodb`

## Kubernetes Disaster Recovery

### Velero for Cluster Backups

**Quick Start:** `velero install --provider aws --bucket my-backups`

Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with `velero restore create --from-backup <name>`. Support selective restore (namespace mappings, storage class remapping).

**Examples:** `examples/kubernetes/velero/`
**Detailed Guide:** `references/kubernetes-dr.md`

### etcd Backup

**Quick Start:** `ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db`

Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

**Examples:** `examples/kubernetes/etcd/`

## Cloud-Specific DR Patterns

### AWS

**Key Services:**
- RDS: Automated backups (30-day retention), PITR, Multi-AZ
- Aurora Global DB: Cross-region active-passive with automatic failover
- S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

**Examples:** `examples/cloud/aws/`
**Detailed Guide:** `references/cloud-dr-patterns.md#aws`

### GCP

**Key Services:**
- Cloud SQL: PITR with 7-day transaction logs, 30-day retention
- GCS Multi-Regional: Automatic replication across 100+ mile separation
- Regional HA: Synchronous replication within region

**Detailed Guide:** `references/cloud-dr-patterns.md#gcp`

### Azure

**Key Services:**
- Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
- Azure Site Recovery: Cross-region VM re