Skill499 repo starsupdated 6d ago

monitoring-and-alerting

This Claude Code skill guides teams through designing comprehensive monitoring and alerting systems for websites and applications. Use it when establishing uptime checks, defining service level objectives, configuring error tracking, determining alert policies, structuring on-call rotations, or addressing alert fatigue issues. It organizes monitoring into four layers: availability, correctness, performance, and error tracking, with specific thresholds and tools for each. The skill distinguishes itself from incident response and post-mortems by focusing on preventive system design rather than reactive management.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/rampstackco/claude-skills /tmp/monitoring-and-alerting && cp -r /tmp/monitoring-and-alerting/dist/pi/.agents/skills/monitoring-and-alerting ~/.claude/skills/monitoring-and-alerting

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Monitoring and Alerting

Decide what to watch, what to alert on, and how to make sure the right person finds out when things break.

---

## When to use

- Setting up monitoring on a new site or service
- Defining SLOs (service level objectives) and error budgets
- Choosing which alerts page someone vs which go to a quiet channel
- Designing or fixing on-call rotation
- Diagnosing alert fatigue
- Filling monitoring gaps revealed by an incident
- Migrating monitoring vendors

## When NOT to use

- Responding to an active incident (use `incident-response`)
- Writing the post-mortem (use `after-action-report`)
- Designing analytics dashboards for product metrics (use `analytics-strategy`)
- Performance optimization itself (use `performance-optimization`)

---

## Required inputs

- The system you're monitoring (URLs, services, dependencies)
- Existing monitoring tools (uptime, errors, logs, APM)
- Business hours and team timezone(s)
- Who is on-call or available for incidents
- Existing SLOs or success metrics, if any

---

## The framework: 4 layers

Monitoring works in layers. Skip a layer and you'll miss a class of problems.

### Layer 1: Availability

Is the site up? The simplest, most important layer.

- HTTP checks from multiple regions (every 1-5 minutes)
- DNS resolution checks
- Certificate expiration checks
- Status code checks (alert on 5xx, not just timeout)

Threshold: any sustained downtime (more than 2 consecutive failed checks) pages.

### Layer 2: Correctness

The site is up, but is it serving the right thing?

- Synthetic checks (a script that loads the homepage, clicks a button, validates expected text)
- Critical user journeys (signup, checkout, search)
- Content presence checks (homepage hasn't gone blank)
- API contract checks (response shape and key fields are present)

Threshold: failures of critical-path synthetics page. Non-critical page-level synthetics alert during business hours only.

### Layer 3: Performance

The site is up and correct, but is it fast enough?

- Core Web Vitals (LCP, INP, CLS) from real users (RUM)
- Synthetic performance (Lighthouse, WebPageTest, custom)
- API response times (p50, p95, p99)
- Database query times for slow queries
- Dependency response times (third-party APIs)

Threshold: regressions from baseline (e.g., p95 doubled in 5 minutes). Don't alert on absolute thresholds without baselines.

### Layer 4: Errors and anomalies

The site is up, correct, and fast for most, but errors are happening.

- Error rate (% of requests returning 5xx)
- Client-side error rate (uncaught JS exceptions)
- Log error volume (unexpected spikes)
- Anomaly detection (traffic falling off a cliff)
- Background job failures
- Queue depth

Threshold: rate-based, not count-based. "Error rate above 1% for 5 minutes" beats "more than 100 errors per minute."

---

## SLOs and error budgets

A Service Level Objective is the target for reliability. Common form: "99.9% of homepage requests succeed in under 2 seconds, measured over 30 days."

The components:
- **The thing you're measuring** (homepage requests)
- **The success criterion** (returns 2xx in under 2 seconds)
- **The target** (99.9% of them)
- **The window** (over 30 days)

The error budget is the inverse: 0.1% of requests can fail. If you've used the whole budget, slow down on risky changes.

### Picking SLOs

Don't aim for 100%. Don't aim for "five nines" (99.999%) unless you really need it. Each nine costs an order of magnitude more.

| SLO | Allowed downtime per month |
|---|---|
| 99% | 7 hours, 18 minutes |
| 99.9% | 43 minutes |
| 99.95% | 21 minutes |
| 99.99% | 4 minutes, 22 seconds |
| 99.999% | 26 seconds |

For most marketing sites, 99.9% is plenty. For SaaS, 99.95% is reasonable. Anything higher needs significant infrastructure investment.

### Using error budgets

When the budget is healthy, ship aggressively. When the budget is half-spent, slow down. When the budget is exhausted, freeze risky changes until reliability recovers.

This is what makes SLOs useful: they create a feedback loop between reliability and velocity.

---

## Workflow

### Step 1: Inventory what's already monitored

What tools are in place? What checks exist? What dashboards? What alerts?

Many teams have a tangle of half-configured tools. The first job is the inventory.

### Step 2: Map the system

Draw the architecture. Front-end, back-end, database, third-party APIs, queues, workers. Each box is a candidate for monitoring.

For each box, ask:
- What does "up" mean?
- What does "correct" mean?
- What does "fast" mean?
- What's the most common failure mode?

### Step 3: Define the SLOs

Pick 3-5 SLOs. They should be:
- Tied to user-visible behavior (not internal metrics)
- Achievable with current infrastructure
- Measured automatically
- Reviewed at least quarterly

### Step 4: Set up checks across the 4 layers

For each box, configure checks at each layer. Some boxes won't have all four; that's fine.

| Box | Availability | Correctness | Performance | Errors |
|---|---|---|---|---|
| Homepage | HTTP check | Synthetic | LCP/INP | JS errors |
| Login API | HTTP check | Synthetic flow | p95 latency | 5xx rate |

### Step 5: Decide what pages and what doesn't

Three tiers:

1. **Page (wakes someone up):** site down, critical flow broken, error rate spike, security incident.
2. **Notify (during business hours):** non-critical synthetic failure, performance regression, slow query, dependency degradation.
3. **Log (no notification):** anomalies for later review, low-priority warnings, info-level events.

Anything in tier 1 must be:
- Actionable (the on-call can do something about it)
- Important (it represents real impact)
- Rare (less than 1-2 per week is the goal)

If tier 1 alerts fire frequently, alert fatigue sets in. People stop responding.

### Step 6: Configure routing

Where do alerts go?

- Tier 1: paging system (e.g., PagerDuty, Opsgenie). Direct to on-call.
- Tier 2: chat channel (Slack, Teams). Tagged with the

More from this repository

accessibility-auditSkill

Run a comprehensive WCAG accessibility audit covering perceivable, operable, understandable, and robust principles. Use this skill whenever the user wants to audit accessibility, review WCAG compliance, fix accessibility issues, prepare for accessibility certification, address an accessibility lawsuit risk, or systematically improve a site's accessibility. Triggers on accessibility audit, WCAG audit, a11y audit, accessibility compliance, ADA compliance, screen reader test, keyboard navigation, accessibility report, fix accessibility, axe scan. Also triggers when accessibility issues have been reported and need systematic remediation.

ads-creative-developmentSkill

How to produce ad creative that converts at performance scale. Hook patterns, format selection, video pacing, variation systems, sequential testing methodology, fatigue detection, brand-voice alignment without conversion dilution, and platform-specific creative norms. Triggers on ad creative, ad design, hook patterns, ad video pacing, creative testing, ad variations, creative refresh, creative fatigue, refresh ad creative, video ads for Meta, TikTok creative, LinkedIn ad creative, ad asset library. Also triggers when a team is producing creative at scale, planning a creative test cycle, or auditing why creative is not converting.

ads-performance-analyticsSkill

How to read paid media dashboards without fooling yourself. Attribution models, platform reporting quirks, multi-platform reconciliation, ROAS vs LTV horizon traps, statistical noise in performance metrics, incrementality testing, and the failure modes that produce expensive lessons. Triggers on read paid media dashboard, attribution analysis, ROAS vs LTV, multi-platform reconciliation, ad incrementality, geo holdout, conversion lift study, ghost bidding, paid media reporting, board-deck paid media metrics, blended CAC, MMM, MTA, last-click attribution. Also triggers when a marketer is about to scale, kill, or rebudget a campaign based on platform metrics, or when reconciling platform reports against warehouse revenue.

after-action-reportSkill

Run a structured after-action review (postmortem, retrospective) on a launch, incident, or completed project to capture timeline, root cause analysis, contributing factors, and actionable lessons. Use this skill whenever the user wants to run a postmortem, retrospective, AAR, or after-action review on any past event. Triggers on after-action report, AAR, postmortem, retrospective, retro, post-incident review, what went well what didn't, lessons learned, blameless postmortem, root cause analysis, RCA, five whys. Also triggers when the user has just shipped something or just resolved an incident and wants to capture learnings.

ai-content-collaborationSkill

How humans and AI compose in content workflows. Where AI legitimately participates, where humans must own, hybrid workflow patterns, voice ownership preservation, the AI slop problem, disclosure and transparency, team calibration, and the ethics of intellectually honest AI-assisted content production. Triggers on AI content workflow, AI-assisted writing, hybrid content production, AI in editorial, AI slop, AI disclosure, AI usage policy, AI content ethics, voice preservation with AI, team AI calibration. Also triggers when content feels generic despite quality tools, when team AI usage has drifted into inconsistency, or when a regulated or trust-sensitive context requires explicit AI policy.

analytics-strategySkill

Design measurement frameworks including event taxonomy, KPI hierarchy, dashboard architecture, attribution models, and analytics implementation strategy. Use this skill whenever the user wants to plan analytics, design dashboards, build event taxonomies, define KPIs, set up tracking, or audit existing measurement. Triggers on analytics strategy, measurement plan, event taxonomy, tracking plan, KPI framework, dashboard design, north star metric, attribution model, conversion tracking, GA4 setup, Mixpanel setup, analytics audit. Also triggers when the user has data but no clear way to use it, or wants to make decisions but doesn't know what to track.

art-directionSkill

Direct visual and creative work for campaigns, photography, illustration, video, and branded experiences. Use this skill whenever the user wants to brief a photographer, direct illustrators, plan a creative campaign, develop visual concepts, write a creative direction document, or evaluate creative work for fit. Triggers on art direction, photo brief, photography brief, illustration brief, campaign concept, creative concept, visual direction, mood board, look and feel, visual treatment, video direction. Also triggers when the user has approved brand identity but needs to extend it into specific creative deliverables.

backup-and-disaster-recoverySkill

Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.