Skill241 estrellas del repoactualizado 11d ago

observability-sre

This Claude Code skill provides guidance on implementing observability and Site Reliability Engineering practices. Use it when designing monitoring and alerting systems, setting up logging and distributed tracing infrastructure, defining service-level objectives (SLOs) with error budgets, or conducting incident response and post-mortems. It emphasizes alert design based on user-facing symptoms rather than infrastructure metrics, proper data cardinality management, and establishing reliability targets tied to business outcomes.

Ver fuente Repositorio: spellbook

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/majiayu000/spellbook /tmp/observability-sre && cp -r /tmp/observability-sre/skills/observability-sre ~/.claude/skills/observability-sre

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Observability & Site Reliability Engineering

## Core Principles

- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility
- **Observability-First** — Build systems that explain their own behavior
- **SLO-Driven** — Define reliability targets that matter to users
- **Proactive Detection** — Find issues before customers do
- **Blameless Culture** — Learn from failures without blame
- **Automate Toil** — Reduce repetitive operational work
- **Continuous Improvement** — Each incident makes systems more resilient
- **Full-Stack Visibility** — Monitor from infrastructure to business metrics

---

## Hard Rules (Must Follow)

> These rules are mandatory. Violating them means the skill is not working correctly.

### Symptom-Based Alerts Only

**Alert on user-facing symptoms, not internal infrastructure metrics.**

```yaml
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"
```

### Low Cardinality Labels

**Loki/Prometheus labels must have low cardinality (<10 unique labels).**

```yaml
# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");
```

### SLO-Based Error Budgets

**Every service must have defined SLOs with error budget tracking.**

```yaml
# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"
```

### Trace Context in Logs

**All logs must include trace_id for correlation with distributed traces.**

```typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}
```

---

## Quick Reference

### When to Use What

| Scenario | Tool/Pattern | Reason |
|----------|--------------|--------|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |

### The Three Pillars

| Pillar | What | When | Tools |
|--------|------|------|-------|
| **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |

**Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

---

## Observability Architecture

### Layered Prometheus Setup

```yaml
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'
```

### Recording Rules for Performance

```yaml
# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /

Del mismo repositorio

backend-typescript-architectSubagent

Senior backend TypeScript architect specializing in Bun/Node.js runtime, API design, database optimization, and scalable server architecture.

code-archaeologistSubagent

Expert at exploring and understanding legacy and unfamiliar codebases. Maps dependencies, identifies patterns, and creates documentation for complex systems.

kubernetes-specialistSubagent

Kubernetes architect specializing in cluster design, manifests, Helm charts, GitOps workflows, security policies, and production operations.

opensource-contributorSubagent

Systematic open source contributor that analyzes projects, finds suitable issues, implements fixes, and creates high-quality PRs with high acceptance probability.

security-auditorSubagent

Application security expert specializing in SAST, vulnerability assessment, OWASP Top 10, compliance auditing, and security architecture review.

senior-code-reviewerSubagent

Fullstack code reviewer with 15+ years experience analyzing code for security vulnerabilities, performance bottlenecks, architectural decisions, and best practices.

tech-lead-orchestratorSubagent

Senior technical lead who analyzes complex projects and coordinates multi-step development tasks. Delegates to specialized agents and ensures quality delivery.

push-allSkill

Use when the user explicitly asks to stage all current changes, create a commit, and push to the remote after safety checks.