Skip to main content
ClaudeWave
Skill209 repo starsupdated today

observability-sre

Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/majiayu000/spellbook /tmp/observability-sre && cp -r /tmp/observability-sre/skills/observability-sre ~/.claude/skills/observability-sre
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Observability & Site Reliability Engineering

## Core Principles

- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility
- **Observability-First** — Build systems that explain their own behavior
- **SLO-Driven** — Define reliability targets that matter to users
- **Proactive Detection** — Find issues before customers do
- **Blameless Culture** — Learn from failures without blame
- **Automate Toil** — Reduce repetitive operational work
- **Continuous Improvement** — Each incident makes systems more resilient
- **Full-Stack Visibility** — Monitor from infrastructure to business metrics

---

## Hard Rules (Must Follow)

> These rules are mandatory. Violating them means the skill is not working correctly.

### Symptom-Based Alerts Only

**Alert on user-facing symptoms, not internal infrastructure metrics.**

```yaml
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"
```

### Low Cardinality Labels

**Loki/Prometheus labels must have low cardinality (<10 unique labels).**

```yaml
# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");
```

### SLO-Based Error Budgets

**Every service must have defined SLOs with error budget tracking.**

```yaml
# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"
```

### Trace Context in Logs

**All logs must include trace_id for correlation with distributed traces.**

```typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}
```

---

## Quick Reference

### When to Use What

| Scenario | Tool/Pattern | Reason |
|----------|--------------|--------|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |

### The Three Pillars

| Pillar | What | When | Tools |
|--------|------|------|-------|
| **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |

**Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

---

## Observability Architecture

### Layered Prometheus Setup

```yaml
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'
```

### Recording Rules for Performance

```yaml
# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /