oma-observability
oma-observability routes intent-based observability and traceability work across metrics, events, logs, traces, and profiles (MELT+P) using vendor-category taxonomy and transport tuning. Use this skill for designing observability pipelines, tracing across service boundaries, selecting vendor categories, tuning Collector topology, incident forensics across six dimensions, implementing observability-as-code with Grafana and PrometheusRule, meta-observability for pipeline health, and tool migrations like Fluentd to OTel.
git clone --depth 1 https://github.com/first-fluke/oh-my-agent /tmp/oma-observability && cp -r /tmp/oma-observability/benchmarks/runs/oma/.agents/skills/oma-observability ~/.claude/skills/oma-observabilitySKILL.md
# Observability Agent - Intent-based Router ## Scheduling ### Goal Route, design, tune, and review observability work across MELT+P signals, layers, boundaries, vendor categories, transport choices, meta-observability, and incident forensics. ### Intent signature - User asks for observability, telemetry, OTel, metrics, logs, traces, profiles, SLOs, RUM, APM, incident forensics, trace propagation, transport tuning, or observability-as-code. - User needs vendor/category routing or observability architecture instead of a single vendor's already-covered setup. ### When to use - Setting up an observability pipeline (OTel SDK + Collector + vendor backend) - Designing traceability across service and domain boundaries (W3C propagators, baggage, multi-tenant, multi-cloud) - Tuning transport layer (UDP/MTU, OTLP gRPC vs HTTP, Collector DaemonSet vs sidecar topology) - Running incident forensics (6-dimension localization: code / service / layer / host / region / infra) - Selecting a vendor category (OSS full-stack vs commercial SaaS vs high-cardinality specialist vs profiling specialist) - Implementing observability-as-code (Grafana Jsonnet dashboards, PrometheusRule CRD, OpenSLO YAML, SLO burn-rate alerts) - Meta-observability (pipeline self-health, clock skew detection, cardinality guardrails, retention matrix) - Covering the MELT+P signal set: metrics, logs, traces, profiles (OTEP 0239), cost (OpenCost), audit (SOC2/ISO), privacy (GDPR/PIPA) - Migrating off deprecated tools (Fluentd → Fluent Bit or OTel Collector, per CNCF 2025-10 guide) ### When NOT to use - LLM ops (prompt versioning, evals, gen_ai span deep dive) — use Langfuse, Arize Phoenix, LangSmith, or Braintrust directly - Data pipeline lineage — use OpenLineage + Marquez, dbt test, or Airflow lineage backends - IoT / hardware / datacenter physical-layer telemetry (IPMI, BMC, SNMP) — use vendor DCIM tooling (Nlyte, Sunbird, Device42) - Chaos engineering orchestration — use Chaos Mesh, Litmus, Gremlin, or ChaosToolkit (this skill consumes their telemetry; it does not orchestrate chaos) - GPU / TPU infrastructure observability — use NVIDIA DCGM Exporter + Prometheus - Software supply chain (SBOM, attestation) — use sigstore (cosign / rekor), in-toto framework, SLSA level attestations - Incident response workflow (on-call rotation, paging, escalation) — use PagerDuty, OpsGenie, or Grafana OnCall - Single-vendor setup already fully covered by that vendor's own published skill — invoke the vendor skill directly ### Expected inputs - Observability intent, target system, architecture boundary, signals, vendor context, and incident symptoms if any - Existing OTel/collector/vendor configs, dashboards, SLOs, trace/log/metric examples, or deployment topology ### Expected outputs - Routed observability guidance, setup/migration/tuning plan, incident-forensics path, alerting/SLO guidance, or observability-as-code recommendations - Transport, meta-observability, privacy, audit, and retention checks - Vendor delegation target when appropriate ### Dependencies - OTel/W3C/CNCF references and resources under `resources/` - Vendor categories, matrix, standards, incident forensics, meta-observability, transport, layers, boundaries, and signal guides ### Control-flow features - Branches by intent, vendor category, layer/boundary/signal matrix, transport topology, privacy/audit risk, and incident localization dimension - May read/write observability config and docs; generally delegates vendor-specific implementation - Requires live status verification for load-bearing CNCF/vendor currency ## Structural Flow ### Entry 1. Classify the intent: setup, migrate, investigate, alert, trace, tune, or route. 2. Identify layers, boundaries, signals, and vendor category. 3. Load only the relevant resource guide(s). ### Scenes 1. **PREPARE**: Classify intent and matrix coverage. 2. **ACQUIRE**: Read configs, topology, telemetry examples, or incident signals. 3. **REASON**: Route vendor/category, tune transport, assess meta-observability, or localize incident. 4. **ACT**: Produce setup/migration/tuning/alert/trace/forensics guidance or config changes. 5. **VERIFY**: Check pipeline health, clock skew, cardinality, retention, privacy, and audit concerns. 6. **FINALIZE**: Report route, evidence, risks, and handoff references. ### Transitions - If a vendor-owned skill fully covers setup, delegate instead of duplicating docs. - If Fluentd appears, recommend Fluent Bit or OTel Collector migration. - If incident investigation is requested, use 6-dimensional localization. - If transport tuning appears, load transport-specific resources. ### Failure and recovery - If live CNCF/vendor status is load-bearing, verify current status. - If telemetry samples are missing, provide instrumentation/collection steps before analysis. - If scope belongs to out-of-scope domains, route to external authoritative tools. ### Exit - Success: observability path is routed, evidence-backed, and checks are explicit. - Partial success: missing telemetry, stale vendor status, or external-domain handoff is explicit. ## Logical Operations ### Actions | Action | SSL primitive | Evidence | |--------|---------------|----------| | Classify observability intent | `SELECT` | Intent rules | | Read telemetry/config evidence | `READ` | OTel/vendor configs, dashboards, samples | | Route vendor/category | `SELECT` | Vendor categories | | Infer coverage gaps | `INFER` | Matrix and signal/boundary mapping | | Validate meta-observability | `VALIDATE` | Clock, cardinality, retention, health | | Write guidance/config | `WRITE` | OaC/config/docs when requested | | Notify result | `NOTIFY` | Routed recommendation | ### Tools and instruments - OTel/CNCF/W3C standards references - Vendor categories, matrix, incident forensics, meta-observability, transport and signal guides - Optional CLI/config tooling from the target stack ### Canonical workflow path ```text 1. Classify intent: setup, migrate, investigate
>
Architecture specialist for software/system design, module and service boundaries, tradeoff analysis, and stakeholder synthesis. Uses context-aware methods such as diagnostic routing, design-twice comparison, ATAM-style risk analysis, CBAM-style prioritization, and ADR-style decision records.
Backend specialist for APIs, databases, authentication with clean architecture (Repository/Service/Router pattern). Use for API, endpoint, REST, database, server, migration, and auth work.
Design-first ideation that explores user intent, constraints, and approaches before any planning or implementation. Use for brainstorming, ideation, exploring concepts, and evaluating approaches.
Guide for coordinating PM, Frontend, Backend, Mobile, and QA agents on complex projects via CLI. Use for manual step-by-step coordination and workflow guidance.
Database specialist for SQL, NoSQL, and vector database modeling, schema design, normalization, indexing, transactions, integrity, concurrency control, backup, capacity planning, data standards, anti-pattern review, and compliance-aware database design. Use for database, schema, ERD, table design, document model, vector index design, RAG retrieval architecture, migration, query tuning, glossary, capacity estimation, backup strategy, database anti-pattern remediation work, and ISO 27001, ISO 27002, or ISO 22301-aware database recommendations.
Bug diagnosis and fixing specialist - analyzes errors, identifies root causes, provides fixes, and writes regression tests. Use for bug, debug, error, crash, traceback, exception, and regression work.
>