Skill1.2k repo starsupdated today

monitoring-setup-guide

This Claude Code skill generates complete monitoring setup guides for production services, specifying metrics, alerts, logs, traces, and dashboards. Use it when asked to establish monitoring for a new service, define or review alerting strategies, create observability plans, design dashboards, or document logging standards for a team. Outputs include metric definition tables, alert rule specifications with actionable thresholds, dashboard wireframes, log schemas, distributed tracing checklists, and monitoring gap analyses that ensure on-call engineers have a single source of truth for service health.

View source Repository: pm-claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/monitoring-setup-guide && cp -r /tmp/monitoring-setup-guide/plugins/pm-engineering/skills/monitoring-setup-guide ~/.claude/skills/monitoring-setup-guide

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Monitoring Setup Guide Skill

Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.

## Required Inputs

Ask for these if not already provided:
- **Service name and description** — what the service does and its role in the system
- **Tech stack** — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
- **Current monitoring tooling** — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
- **Key user journeys** — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
- **Existing alerts** — paste any existing alert configurations or describe what's currently monitored

## Output Format

---

# Monitoring Setup Guide: [Service Name]

**Team:** [Team name] | **Tech lead:** [Name]
**Stack:** [Language/Framework] on [Infrastructure]
**Monitoring platform:** [Datadog / Prometheus+Grafana / CloudWatch / etc.]
**Date:** [Date] | **Review cycle:** Quarterly

---

## 1. Monitoring Philosophy

Good monitoring answers three questions:
1. **Is the service healthy right now?** (alerting)
2. **Was it healthy in the past, and is it trending worse?** (dashboards + SLO tracking)
3. **Why did something fail?** (logs + traces)

This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.

**Key user journeys monitored:**
- Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
- Journey 2: [e.g. "User views transaction history — GET /transactions"]
- Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]

---

## 2. The Four Golden Signals

Apply the four golden signals specifically to [Service Name]:

### Latency

Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.

| Metric | Description | Source | Dimensions |
|---|---|---|---|
| `[service].request.duration_ms` | End-to-end request latency | Application instrumentation | `endpoint`, `method`, `status_code` |
| `[service].db.query_duration_ms` | Database query latency | ORM / query instrumentation | `query_name`, `table` |
| `[service].external.request_duration_ms` | Outbound call latency to dependencies | HTTP client instrumentation | `target_service`, `endpoint` |
| `[service].queue.processing_duration_ms` | Time to process one message (if applicable) | Consumer instrumentation | `queue_name`, `message_type` |

**Latency SLO targets:**

| Endpoint / operation | p50 target | p95 target | p99 target |
|---|---|---|---|
| `GET /api/v1/[resource]` | < [50] ms | < [200] ms | < [500] ms |
| `POST /api/v1/[resource]` | < [100] ms | < [400] ms | < [1000] ms |
| `GET /health` | < [10] ms | < [20] ms | < [50] ms |
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |

### Traffic

Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.

| Metric | Description | Source |
|---|---|---|
| `[service].request.count` | Requests per second | Application / load balancer |
| `[service].request.count_by_endpoint` | RPS broken down by endpoint | Application |
| `[service].queue.messages_consumed_per_second` | Consumer throughput | Queue consumer |
| `[service].queue.depth` | Messages waiting in queue | Queue metrics |

**Traffic baselines (update after observing production for 2+ weeks):**

| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|---|---|---|---|
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |

### Errors

Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).

| Metric | Description | Alert on? |
|---|---|---|
| `[service].request.error_rate` | 5xx errors / total requests | Yes — see alert rules |
| `[service].request.client_error_rate` | 4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
| `[service].dependency.error_rate` | Errors calling downstream dependencies | Yes — upstream health signal |
| `[service].queue.dlq_depth` | Messages in dead-letter queue | Yes — indicates processing failures |

### Saturation

Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.

| Resource | Metric | Alert threshold | Source |
|---|---|---|---|
| CPU | `[service].cpu.utilisation_pct` | >80% sustained 5 min | Container / VM metrics |
| Memory | `[service].memory.utilisation_pct` | >85% sustained 5 min | Container / VM metrics |
| DB connections | `[service].db.connection_pool.utilisation_pct` | >75% | Application / DB metrics |
| Thread pool / goroutines | `[service].runtime.goroutine_count` / `thread_count` | >N (establish baseline) | Runtime metrics |
| Disk (if applicable) | `[service].disk.utilisation_pct` | >75% | Infrastructure |
| Queue depth (if applicable) | `[service].queue.depth` | >[backlog threshold] | Queue metrics |

---

## 3. Business Metrics

Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.

| Metric | Description | Source | Alert? |
|---|---|---|---|
| `[service].[primary_action].success_rate` | [e.g. "Payment success rate"]