Skill1.2k repo starsupdated today

oma-observability

oma-observability routes intent-based observability and traceability work across metrics, events, logs, traces, and profiles (MELT+P) using vendor-category taxonomy and transport tuning. Use this skill for designing observability pipelines, tracing across service boundaries, selecting vendor categories, tuning Collector topology, incident forensics across six dimensions, implementing observability-as-code with Grafana and PrometheusRule, meta-observability for pipeline health, and tool migrations like Fluentd to OTel.

View source Repository: oh-my-agent

Install in Claude Code

Copy

git clone --depth 1 https://github.com/first-fluke/oh-my-agent /tmp/oma-observability && cp -r /tmp/oma-observability/benchmarks/runs/oma/.agents/skills/oma-observability ~/.claude/skills/oma-observability

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Observability Agent - Intent-based Router

## Scheduling

### Goal
Route, design, tune, and review observability work across MELT+P signals, layers, boundaries, vendor categories, transport choices, meta-observability, and incident forensics.

### Intent signature
- User asks for observability, telemetry, OTel, metrics, logs, traces, profiles, SLOs, RUM, APM, incident forensics, trace propagation, transport tuning, or observability-as-code.
- User needs vendor/category routing or observability architecture instead of a single vendor's already-covered setup.

### When to use
- Setting up an observability pipeline (OTel SDK + Collector + vendor backend)
- Designing traceability across service and domain boundaries (W3C propagators, baggage, multi-tenant, multi-cloud)
- Tuning transport layer (UDP/MTU, OTLP gRPC vs HTTP, Collector DaemonSet vs sidecar topology)
- Running incident forensics (6-dimension localization: code / service / layer / host / region / infra)
- Selecting a vendor category (OSS full-stack vs commercial SaaS vs high-cardinality specialist vs profiling specialist)
- Implementing observability-as-code (Grafana Jsonnet dashboards, PrometheusRule CRD, OpenSLO YAML, SLO burn-rate alerts)
- Meta-observability (pipeline self-health, clock skew detection, cardinality guardrails, retention matrix)
- Covering the MELT+P signal set: metrics, logs, traces, profiles (OTEP 0239), cost (OpenCost), audit (SOC2/ISO), privacy (GDPR/PIPA)
- Migrating off deprecated tools (Fluentd → Fluent Bit or OTel Collector, per CNCF 2025-10 guide)

### When NOT to use
- LLM ops (prompt versioning, evals, gen_ai span deep dive) — use Langfuse, Arize Phoenix, LangSmith, or Braintrust directly
- Data pipeline lineage — use OpenLineage + Marquez, dbt test, or Airflow lineage backends
- IoT / hardware / datacenter physical-layer telemetry (IPMI, BMC, SNMP) — use vendor DCIM tooling (Nlyte, Sunbird, Device42)
- Chaos engineering orchestration — use Chaos Mesh, Litmus, Gremlin, or ChaosToolkit (this skill consumes their telemetry; it does not orchestrate chaos)
- GPU / TPU infrastructure observability — use NVIDIA DCGM Exporter + Prometheus
- Software supply chain (SBOM, attestation) — use sigstore (cosign / rekor), in-toto framework, SLSA level attestations
- Incident response workflow (on-call rotation, paging, escalation) — use PagerDuty, OpsGenie, or Grafana OnCall
- Single-vendor setup already fully covered by that vendor's own published skill — invoke the vendor skill directly

### Expected inputs
- Observability intent, target system, architecture boundary, signals, vendor context, and incident symptoms if any
- Existing OTel/collector/vendor configs, dashboards, SLOs, trace/log/metric examples, or deployment topology

### Expected outputs
- Routed observability guidance, setup/migration/tuning plan, incident-forensics path, alerting/SLO guidance, or observability-as-code recommendations
- Transport, meta-observability, privacy, audit, and retention checks
- Vendor delegation target when appropriate

### Dependencies
- OTel/W3C/CNCF references and resources under `resources/`
- Vendor categories, matrix, standards, incident forensics, meta-observability, transport, layers, boundaries, and signal guides

### Control-flow features
- Branches by intent, vendor category, layer/boundary/signal matrix, transport topology, privacy/audit risk, and incident localization dimension
- May read/write observability config and docs; generally delegates vendor-specific implementation
- Requires live status verification for load-bearing CNCF/vendor currency

## Structural Flow

### Entry
1. Classify the intent: setup, migrate, investigate, alert, trace, tune, or route.
2. Identify layers, boundaries, signals, and vendor category.
3. Load only the relevant resource guide(s).

### Scenes
1. **PREPARE**: Classify intent and matrix coverage.
2. **ACQUIRE**: Read configs, topology, telemetry examples, or incident signals.
3. **REASON**: Route vendor/category, tune transport, assess meta-observability, or localize incident.
4. **ACT**: Produce setup/migration/tuning/alert/trace/forensics guidance or config changes.
5. **VERIFY**: Check pipeline health, clock skew, cardinality, retention, privacy, and audit concerns.
6. **FINALIZE**: Report route, evidence, risks, and handoff references.

### Transitions
- If a vendor-owned skill fully covers setup, delegate instead of duplicating docs.
- If Fluentd appears, recommend Fluent Bit or OTel Collector migration.
- If incident investigation is requested, use 6-dimensional localization.
- If transport tuning appears, load transport-specific resources.

### Failure and recovery
- If live CNCF/vendor status is load-bearing, verify current status.
- If telemetry samples are missing, provide instrumentation/collection steps before analysis.
- If scope belongs to out-of-scope domains, route to external authoritative tools.

### Exit
- Success: observability path is routed, evidence-backed, and checks are explicit.
- Partial success: missing telemetry, stale vendor status, or external-domain handoff is explicit.

## Logical Operations

### Actions
| Action | SSL primitive | Evidence |
|--------|---------------|----------|
| Classify observability intent | `SELECT` | Intent rules |
| Read telemetry/config evidence | `READ` | OTel/vendor configs, dashboards, samples |
| Route vendor/category | `SELECT` | Vendor categories |
| Infer coverage gaps | `INFER` | Matrix and signal/boundary mapping |
| Validate meta-observability | `VALIDATE` | Clock, cardinality, retention, health |
| Write guidance/config | `WRITE` | OaC/config/docs when requested |
| Notify result | `NOTIFY` | Routed recommendation |

### Tools and instruments
- OTel/CNCF/W3C standards references
- Vendor categories, matrix, incident forensics, meta-observability, transport and signal guides
- Optional CLI/config tooling from the target stack

### Canonical workflow path
```text
1. Classify intent: setup, migrate, investigate

More from this repository

oma-academic-writerSkill

oma-architectureSkill

Architecture specialist for software/system design, module and service boundaries, tradeoff analysis, and stakeholder synthesis. Uses context-aware methods such as diagnostic routing, design-twice comparison, ATAM-style risk analysis, CBAM-style prioritization, and ADR-style decision records.

oma-backendSkill

Backend specialist for APIs, databases, authentication with clean architecture (Repository/Service/Router pattern). Use for API, endpoint, REST, database, server, migration, and auth work.

oma-brainstormSkill

Design-first ideation that explores user intent, constraints, and approaches before any planning or implementation. Use for brainstorming, ideation, exploring concepts, and evaluating approaches.

oma-coordinationSkill

Guide for coordinating PM, Frontend, Backend, Mobile, and QA agents on complex projects via CLI. Use for manual step-by-step coordination and workflow guidance.

oma-dbSkill

Database specialist for SQL, NoSQL, and vector database modeling, schema design, normalization, indexing, transactions, integrity, concurrency control, backup, capacity planning, data standards, anti-pattern review, and compliance-aware database design. Use for database, schema, ERD, table design, document model, vector index design, RAG retrieval architecture, migration, query tuning, glossary, capacity estimation, backup strategy, database anti-pattern remediation work, and ISO 27001, ISO 27002, or ISO 22301-aware database recommendations.

oma-debugSkill

Bug diagnosis and fixing specialist - analyzes errors, identifies root causes, provides fixes, and writes regression tests. Use for bug, debug, error, crash, traceback, exception, and regression work.

oma-deepsecSkill