Skill1.8k repo starsupdated 5d ago

system-design

The system-design skill provides a structured four-step framework for architecting distributed systems, covering requirements clarification, high-level design proposals, deep dives into critical components, and tradeoff analysis. Use it when designing new services, preparing for system design interviews, estimating capacity for scaling challenges, or choosing between architectural patterns like microservices versus monoliths for high-availability systems.

View source Repository: skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/wondelai/skills /tmp/system-design && cp -r /tmp/system-design/system-design ~/.claude/skills/system-design

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# System Design Framework

A structured approach to designing large-scale distributed systems. Apply these principles when architecting new services, reviewing designs, estimating capacity, or preparing for system design discussions.

## Core Principle

**Start with requirements, not solutions.** Jumping to architecture before understanding constraints produces over- or under-engineered systems. Scalable systems are assembled from well-understood building blocks (load balancers, caches, queues, databases, CDNs) — the skill lies in choosing the right blocks, sizing them with estimates, and owning the tradeoffs each choice introduces.

## Scoring

**Goal: 10/10.** Rate any system design 0-10: a 10/10 states requirements explicitly, includes back-of-the-envelope estimates, uses appropriate building blocks, addresses scaling and reliability, and acknowledges tradeoffs. Always state the current score and the specific improvements needed to reach 10/10.

## The System Design Framework

Six areas for building reliable, scalable distributed systems:

### 1. The Four-Step Process

**Core concept:** Every design follows four stages: (1) understand the problem and establish scope, (2) propose a high-level design and get buy-in, (3) dive deep into critical components, (4) wrap up with tradeoffs and future improvements.

**Why it works:** Without structure, designs either stay too abstract or get lost in premature detail. The four steps invest time proportionally — broad strokes first, depth where it matters.

**Key insights:**
- Step 1 (~5-10 min): clarifying questions, functional and non-functional requirements, agreed scale (DAU, QPS, storage)
- Step 2 (~15-20 min): high-level diagram with APIs, services, data stores, data flow arrows
- Step 3 (~15-20 min): design the 2-3 hardest or most critical components in detail
- Step 4 (~5 min): tradeoffs, bottlenecks, future improvements
- Never skip Step 1 — ambiguous scope wastes all downstream effort; get explicit agreement on assumptions

**Code applications:**

| Context | Pattern | Example |
|---------|---------|---------|
| **New service kickoff** | One-page design doc covering all four steps before coding | Requirements, API contract, data model, capacity estimate, then implementation |
| **Architecture review** | Walk reviewers through the steps sequentially | Scope, diagram, deep-dive on riskiest component, open questions |
| **Incident postmortem** | Trace the failure through the four-step lens | Which requirement was missed? Which block failed? What tradeoff bit us? |

See: [references/four-step-process.md](references/four-step-process.md)

### 2. Back-of-the-Envelope Estimation

**Core concept:** Use powers of two, latency numbers, and simple arithmetic to estimate QPS, storage, bandwidth, and server count before committing to an architecture.

**Why it works:** Estimation prevents over-provisioning (wasted money) and under-provisioning (outages under load). A 2-minute calculation can save weeks of rework.

**Key insights:**
- Powers of two: 2^10 ≈ 1 thousand, 2^20 ≈ 1 million, 2^30 ≈ 1 billion, 2^40 ≈ 1 trillion
- Latency: memory read ~100 ns, SSD read ~100 us, disk seek ~10 ms, same-datacenter round trip ~0.5 ms, cross-continent ~150 ms
- Availability nines: 99.9% = 8.77 hours downtime/year; 99.99% = 52.6 minutes/year
- QPS: DAU x actions-per-day / 86,400 seconds; peak is typically 2-5x average
- Storage: records-per-day x record-size x retention
- Round aggressively — the goal is order of magnitude, not precision

**Code applications:**

| Context | Pattern | Example |
|---------|---------|---------|
| **Capacity planning** | Estimate QPS, multiply by growth factor | 100M DAU x 5 actions / 86400 = ~5,800 QPS avg, ~30K peak |
| **Storage budgeting** | Per-record size x volume x retention | 500M tweets/day x 300 bytes x 365 days = ~55 TB/year |
| **SLA definition** | Convert nines to allowed downtime | Four nines = ~52 minutes downtime per year |

See: [references/estimation-numbers.md](references/estimation-numbers.md)

### 3. Building Blocks

**Core concept:** Scalable systems are assembled from a standard toolkit: DNS, CDN, load balancers, reverse proxies, application servers, caches, message queues, and consistent hashing.

**Why it works:** Each block solves a specific scaling or reliability problem. Knowing when and why to introduce each prevents both premature complexity and avoidable bottlenecks.

**Key insights:**
- Load balancers: L4 (transport layer — fast, simple) vs L7 (application layer — content-aware routing)
- Cache layers: client, CDN, web server, application (Redis/Memcached), database query cache
- Cache strategies: cache-aside (app manages), read-through, write-through (synchronous), write-behind (asynchronous)
- Message queues (Kafka, RabbitMQ, SQS): decouple producers from consumers, absorb spikes, enable async processing
- Consistent hashing: distributes keys across nodes with minimal redistribution when nodes change

**Code applications:**

| Context | Pattern | Example |
|---------|---------|---------|
| **Read-heavy workload** | Cache-aside Redis in front of database | Cache user profiles with TTL; invalidate on write |
| **Traffic spikes** | Message queue between API and workers | Enqueue image-resize jobs; workers pull at their own pace |
| **Global users** | CDN for static assets | Serve JS/CSS/images from edge; origin serves only API |
| **Uneven load** | Consistent hashing for shard assignment | Adding a node moves only ~1/n keys |

See: [references/building-blocks.md](references/building-blocks.md)

### 4. Database Design and Scaling

**Core concept:** Choose SQL vs NoSQL based on data shape and access patterns; scale vertically first, then horizontally (replication and sharding) when vertical limits are reached.

**Why it works:** The database is usually the first bottleneck. Understanding replication, sharding, and denormalization tradeoffs delays expensive re-architectures and makes growth deliberate.

**K

More from this repository

37signals-waySkill

Build lean, opinionated products using the 37signals philosophy from "Getting Real", "Rework", and "Shape Up". Use when the user mentions "Getting Real", "Rework", "Shape Up", "37signals", "Basecamp method", "six-week cycles", "fixed time variable scope", "appetite vs estimates", "betting table", "breadboarding", "fat marker sketch", "build less", "underdo the competition", "opinionated software", "we have too many meetings", "the project keeps growing", "how do we ship faster", or "stop overbuilding". Also trigger when cutting scope to ship sooner, fighting feature creep, running a small team, avoiding long-term roadmaps, or killing unnecessary meetings. Covers shaping, betting, building, and the art of saying no. For MVP validation, see lean-startup. For design sprints, see design-sprint.

blue-ocean-strategySkill

Create uncontested market space using value innovation instead of competing head-to-head. Use when the user mentions "blue ocean", "red ocean", "strategy canvas", "ERRC framework", "value innovation", "non-customers", "buyer utility map", "uncontested market", "stop competing on price", "everyone is undercutting us", "the market is too crowded", or "how do we stand out". Also trigger when comparing pricing strategies, exploring new market categories, finding underserved or non-customers, or escaping a brutal price war. Covers the Four Actions Framework, buyer utility map, and value-cost trade-offs. For tech adoption strategy, see crossing-the-chasm. For product positioning, see obviously-awesome.

clean-architectureSkill

Structure software around the Dependency Rule: source code dependencies point inward from frameworks to use cases to entities. Use when the user mentions "architecture layers", "dependency rule", "ports and adapters", "hexagonal architecture", "onion architecture", "screaming architecture", "where should business logic go", "decouple from the database", "swap the framework without a rewrite", "business logic is tangled with the framework", or "keep business rules independent". Also trigger when deciding which layer code belongs in, isolating core logic from infrastructure, defining module boundaries, or debating whether the framework should call your code or the reverse. Covers component principles, boundaries, and SOLID. For code-level quality, see clean-code. For domain modeling, see domain-driven-design.

clean-codeSkill

Write readable, maintainable code through disciplined naming, small functions, and clean error handling. Use when the user mentions "code review", "naming conventions", "function too long", "code smells", "readable code", "boy scout rule", "single responsibility", "unit test quality", "my code is hard to read", "this function is a mess", "clean up this code", or "hard to maintain". Also trigger when reviewing pull requests for readability, untangling messy functions, debating comment styles, or improving error handling patterns. Covers SRP, comment discipline, formatting, and unit testing. For refactoring techniques, see refactoring-patterns. For architecture, see clean-architecture.

contagiousSkill

Engineer word-of-mouth and virality using the STEPPS framework (Social Currency, Triggers, Emotion, Public, Practical Value, Stories). Use when the user mentions "go viral", "word of mouth", "shareable content", "social currency", "why people share", "viral loop", "referral program", "organic growth", "how do I get people to share this", "nobody is sharing it", or "make this spread". Also trigger when designing shareable features, crafting social campaigns, or building products that spread through peer recommendation. Covers environmental triggers and high-arousal emotional content. For sticky messaging, see made-to-stick. For persuasion tactics, see influence-psychology.

continuous-discoverySkill

Build a weekly cadence of customer touchpoints using Opportunity Solution Trees, assumption mapping, and interview snapshots. Use when the user mentions "continuous discovery", "opportunity solution tree", "weekly interviews", "assumption testing", "discovery habits", "product trio", "outcome-based roadmap", "how do I talk to customers regularly", "we keep building things nobody uses", or "connect research to the roadmap". Also trigger when setting up regular customer feedback loops, prioritizing which experiments to run, or tying discovery insights to delivery work. Covers experience mapping, co-creation, and prioritizing opportunities. For interview technique, see mom-test. For team structure, see inspired-product.

cro-methodologySkill

Audit websites and landing pages for conversion issues and design evidence-based A/B tests. Use when the user mentions "landing page isnt converting", "conversion rate", "A/B test", "why visitors leave", "objection handling", "bounce rate", "split testing", "conversion funnel", "increase signups", "people add to cart but dont buy", or "improve conversions". Also trigger when diagnosing why signups are low, designing experiment hypotheses, or auditing checkout flows for friction points. Covers funnel mapping, persuasion assets, and objection/counter-objection frameworks. For overall marketing strategy, see one-page-marketing. For usability issues, see ux-heuristics.

crossing-the-chasmSkill

Navigate the technology adoption lifecycle from early adopters to mainstream market. Use when the user mentions "crossing the chasm", "beachhead segment", "whole product", "early adopters vs mainstream", "tech go-to-market", "bowling pin strategy", "technology adoption lifecycle", "pragmatist buyers", "growth stalled after early adopters", "cant get mainstream customers", or "our go-to-market plan". Also trigger when a startup has early traction but cant grow beyond initial users, or when planning go-to-market for a technical product. Covers the D-Day analogy, bowling-pin strategy, and positioning against incumbents. For product positioning, see obviously-awesome. For new market creation, see blue-ocean-strategy.