Skip to main content
ClaudeWave
Skill374 estrellas del repoactualizado 6mo ago

designing-distributed-systems

This Claude Code skill provides comprehensive guidance for architecting distributed systems using proven patterns including CAP and PACELC theorem trade-offs, consistency models spanning from linearizable to eventual, replication strategies such as leader-follower and leaderless approaches, partitioning techniques, transaction patterns like saga and event sourcing, and resilience patterns including circuit breakers. Use when designing microservices architectures, building multi-region systems, selecting between consistency and availability during failures, implementing distributed transactions, or establishing service discovery and inter-service communication protocols.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/designing-distributed-systems && cp -r /tmp/designing-distributed-systems/skills/designing-distributed-systems ~/.claude/skills/designing-distributed-systems
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Designing Distributed Systems

Design scalable, reliable, and fault-tolerant distributed systems using proven patterns and consistency models.

## Purpose

Distributed systems are the foundation of modern cloud-native applications. Understanding fundamental trade-offs (CAP theorem, PACELC), consistency models, replication patterns, and resilience strategies is essential for building systems that scale globally while maintaining correctness and availability.

## When to Use This Skill

Apply when:
- Designing microservices architectures with multiple services
- Building systems that must scale across multiple datacenters or regions
- Choosing between consistency vs availability during network partitions
- Selecting replication strategies (single-leader, multi-leader, leaderless)
- Implementing distributed transactions (saga pattern, event sourcing, CQRS)
- Designing partition-tolerant systems with proper consistency guarantees
- Building resilient services with circuit breakers, bulkheads, retries
- Implementing service discovery and inter-service communication

## Core Concepts

### CAP Theorem Fundamentals

**CAP Theorem:** In a distributed system experiencing a network partition, choose between Consistency (C) or Availability (A). Partition tolerance (P) is mandatory.

```
Network partitions WILL occur → Always design for P

During partition:
├─ CP (Consistency + Partition Tolerance)
│  Use when: Financial transactions, inventory, seat booking
│  Trade-off: System unavailable during partition
│  Examples: HBase, MongoDB (default), etcd
│
└─ AP (Availability + Partition Tolerance)
   Use when: Social media, caching, analytics, shopping carts
   Trade-off: Stale reads possible, conflicts need resolution
   Examples: Cassandra, DynamoDB, Riak
```

**PACELC:** Extends CAP to consider normal operations (no partition).
- **If Partition:** Choose Availability (A) or Consistency (C)
- **Else (normal):** Choose Latency (L) or Consistency (C)

### Consistency Models Spectrum

```
Strong Consistency ◄─────────────────────► Eventual Consistency
      │                    │                      │
  Linearizable      Causal Consistency     Convergent
  (Slowest,         (Middle Ground,        (Fastest,
   Most Consistent)  Causally Ordered)     Eventually Consistent)
```

**Strong Consistency (Linearizability):**
- All operations appear atomically in sequential order
- Reads always return most recent write
- Use for: Bank balances, inventory stock, seat booking
- Trade-off: Higher latency, reduced availability

**Eventual Consistency:**
- If no new updates, all replicas eventually converge
- Use for: Social feeds, product catalogs, user profiles, DNS
- Trade-off: Stale reads possible, conflict resolution needed

**Causal Consistency:**
- Causally related operations seen in same order by all nodes
- Use for: Chat apps, collaborative editing, comment threads
- Trade-off: More complex than eventual, requires causality tracking

**Bounded Staleness:**
- Staleness bounded by time or version count
- Use for: Real-time dashboards, leaderboards, monitoring
- Trade-off: Must monitor lag, more complex than eventual

### Replication Patterns

**1. Leader-Follower (Single-Leader):**
- All writes to leader, replicated to followers
- Followers handle reads (load distribution)
- **Synchronous:** Wait for follower ACK (strong consistency, higher latency)
- **Asynchronous:** Don't wait (eventual consistency, possible data loss)
- Use for: Most common pattern, strong consistency with sync replication

**2. Multi-Leader:**
- Multiple leaders accept writes in different datacenters
- Leaders replicate to each other
- **Conflict resolution required:** Last-Write-Wins, application merge, vector clocks
- Use for: Multi-datacenter, low write latency, geo-distributed users
- Trade-off: Conflict resolution complexity

**3. Leaderless (Dynamo-style):**
- No single leader, quorum-based reads/writes
- **Quorum rule:** W + R > N (W=write quorum, R=read quorum, N=replicas)
- Example: N=5, W=3, R=2 → Strong consistency (overlap guaranteed)
- Use for: Maximum availability, partition tolerance
- Trade-off: Complexity, read repair needed

### Partitioning Strategies

**Hash Partitioning (Consistent Hashing):**
- Key → Hash(Key) → Partition assignment
- Even distribution, minimal rebalancing when nodes added/removed
- Use for: Point queries by ID, even distribution critical
- Examples: Cassandra, DynamoDB, Redis Cluster

**Range Partitioning:**
- Key ranges assigned to partitions (A-F, G-M, N-S, T-Z)
- Enables range queries, ordered data
- Risk: Hot spots if data skewed
- Use for: Time-series data, leaderboards, range scans
- Examples: HBase, Bigtable

**Geographic Partitioning:**
- Partition by location (US-East, EU-West, APAC)
- Use for: Data locality, GDPR compliance, low latency
- Examples: Spanner, Cosmos DB

### Resilience Patterns

**Circuit Breaker:**
```
[Closed] → Normal operation
   │ (failures exceed threshold)
   ▼
[Open] → Fail fast (don't call failing service)
   │ (timeout expires)
   ▼
[Half-Open] → Try single request
   │ success → [Closed]
   │ failure → [Open]
```
- Prevents cascading failures
- Fast-fail instead of waiting for timeout
- See references/resilience-patterns.md

**Bulkhead Isolation:**
- Isolate resources (thread pools, connection pools)
- Failure in one partition doesn't affect others
- Like ship compartments preventing total flooding

**Timeout and Retry:**
- **Timeout:** Set deadlines, fail fast if exceeded
- **Retry:** Exponential backoff with jitter
- **Idempotency:** Ensure safe retry (critical)

**Rate Limiting and Backpressure:**
- Protect services from overload
- Token bucket, leaky bucket algorithms
- Backpressure: Signal upstream to slow down

### Transaction Patterns

**Saga Pattern:**
- Coordinate distributed transactions across services
- No distributed 2PC (two-phase commit)

**Choreography:** Services react to events
```
Order Service → OrderCreated event
Payment Service → listens →
administering-linuxSkill

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

ai-data-engineeringSkill

Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).

architecting-dataSkill

Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.

architecting-networksSkill

Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.

architecting-securitySkill

Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.

assembling-componentsSkill

Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.

building-ai-chatSkill

Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.

building-ci-pipelinesSkill

Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.