observability-sre
Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.
git clone --depth 1 https://github.com/majiayu000/spellbook /tmp/observability-sre && cp -r /tmp/observability-sre/skills/observability-sre ~/.claude/skills/observability-sreSKILL.md
# Observability & Site Reliability Engineering
## Core Principles
- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility
- **Observability-First** — Build systems that explain their own behavior
- **SLO-Driven** — Define reliability targets that matter to users
- **Proactive Detection** — Find issues before customers do
- **Blameless Culture** — Learn from failures without blame
- **Automate Toil** — Reduce repetitive operational work
- **Continuous Improvement** — Each incident makes systems more resilient
- **Full-Stack Visibility** — Monitor from infrastructure to business metrics
---
## Hard Rules (Must Follow)
> These rules are mandatory. Violating them means the skill is not working correctly.
### Symptom-Based Alerts Only
**Alert on user-facing symptoms, not internal infrastructure metrics.**
```yaml
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
expr: cpu_usage > 70%
# Users don't care about CPU, they care about latency
- alert: MemoryHigh
expr: memory_usage > 80%
# Internal metric, may not affect users
# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200
annotations:
summary: "Users experiencing slow response times"
- alert: ErrorRateHigh
expr: slo:api_errors:rate5m > 0.001
annotations:
summary: "Users encountering errors"
```
### Low Cardinality Labels
**Loki/Prometheus labels must have low cardinality (<10 unique labels).**
```yaml
# ❌ FORBIDDEN: High cardinality labels
labels:
user_id: "usr_123" # Millions of values!
order_id: "ord_456" # Millions of values!
request_id: "req_789" # Every request is unique!
# ✅ REQUIRED: Low cardinality only
labels:
namespace: "production" # Few values
app: "api-server" # Few values
level: "error" # 5-6 values
method: "GET" # ~10 values
# High cardinality data goes in log body:
logger.info({
user_id: "usr_123", # In JSON body, not label
order_id: "ord_456",
}, "Order processed");
```
### SLO-Based Error Budgets
**Every service must have defined SLOs with error budget tracking.**
```yaml
# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets
# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime
groups:
- name: slo_tracking
rules:
- record: slo:api_availability:ratio
expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- alert: ErrorBudgetBurnRate
expr: slo:api_availability:ratio < 0.999
for: 5m
annotations:
summary: "Burning error budget too fast"
```
### Trace Context in Logs
**All logs must include trace_id for correlation with distributed traces.**
```typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");
// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
order_id: "ord_123",
}, "Payment processed");
// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}
```
---
## Quick Reference
### When to Use What
| Scenario | Tool/Pattern | Reason |
|----------|--------------|--------|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |
### The Three Pillars
| Pillar | What | When | Tools |
|--------|------|------|-------|
| **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |
**Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level)
---
## Observability Architecture
### Layered Prometheus Setup
```yaml
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down
# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)
# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level
# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level
# Global Prometheus config
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{__name__=~"job:.*"}' # Recording rules only
static_configs:
- targets:
- 'cluster-prom-us-east.internal:9090'
- 'cluster-prom-eu-west.internal:9090'
```
### Recording Rules for Performance
```yaml
# Precompute expensive queries
groups:
- name: api_performance
interval: 30s
rules:
# Request rate (requests per second)
- record: job:api_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status)
# Error rate
- record: job:api_errors:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/Senior backend TypeScript architect specializing in Bun/Node.js runtime, API design, database optimization, and scalable server architecture.
Expert at exploring and understanding legacy and unfamiliar codebases. Maps dependencies, identifies patterns, and creates documentation for complex systems.
Kubernetes architect specializing in cluster design, manifests, Helm charts, GitOps workflows, security policies, and production operations.
Systematic open source contributor that analyzes projects, finds suitable issues, implements fixes, and creates high-quality PRs with high acceptance probability.
Application security expert specializing in SAST, vulnerability assessment, OWASP Top 10, compliance auditing, and security architecture review.
Fullstack code reviewer with 15+ years experience analyzing code for security vulnerabilities, performance bottlenecks, architectural decisions, and best practices.
Senior technical lead who analyzes complex projects and coordinates multi-step development tasks. Delegates to specialized agents and ensures quality delivery.
Use when the user explicitly asks to stage all current changes, create a commit, and push to the remote after safety checks.