Skill412 repo starsupdated 3d ago

operations

# Operations Skill Description The operations skill manages business continuity workflows including vendor evaluation, incident procedures, process documentation, risk assessment, capacity planning, change deployment, and compliance verification. Use this skill when addressing operational tasks like creating runbooks, assessing risks, reviewing vendors, documenting processes, managing changes, planning resources, preparing audits, or improving workflows. Route strategic decisions to csuite and financial planning to finance instead.

View source Repository: vexjoy-agent

Install in Claude Code

Copy

git clone --depth 1 https://github.com/notque/vexjoy-agent /tmp/operations && cp -r /tmp/operations/skills/business/operations ~/.claude/skills/operations

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Operations

Umbrella skill for business operations: vendor management, runbooks, process documentation, risk assessment, capacity planning, change management, compliance tracking, status reporting, and process optimization. Each mode loads its own reference files on demand.

**Scope**: Operational workflows that keep the business running. Use csuite for strategic decisions, finance for budgeting/forecasting, and hr for people operations.

---

## Mode Detection

Classify the request into exactly one mode. If it spans multiple, choose the primary and note the secondary.

| Mode | Signal Phrases | Reference |
|------|---------------|-----------|
| **RUNBOOK** | runbook, procedure, on-call, playbook, step-by-step, ops task | `references/runbook-authoring.md` |
| **RISK** | risk assessment, risk register, what could go wrong, risk matrix | `references/risk-assessment.md` |
| **VENDOR** | vendor review, vendor evaluation, contract review, procurement | `references/vendor-management.md` |
| **PROCESS** | process doc, SOP, RACI, workflow documentation, process map | `references/process-documentation.md` |
| **CHANGE** | change request, change management, CAB, rollout, deployment change | `references/change-management.md` |
| **CAPACITY** | capacity plan, resource allocation, utilization, headcount planning | `references/process-documentation.md` |
| **COMPLIANCE** | compliance, audit prep, SOC 2, ISO 27001, GDPR, regulatory | `references/risk-assessment.md` |
| **STATUS** | status report, weekly update, project health, KPIs | (no deep reference needed) |
| **OPTIMIZE** | process improvement, bottleneck, streamline, too many steps | `references/process-documentation.md` |

Always load `references/llm-ops-failure-modes.md` regardless of mode. It contains the failure patterns that apply across all operations work.

---

## Instructions

### Mode: RUNBOOK

**Framework**: SCOPE -> AUTHOR -> VERIFY

**Phase 1: SCOPE** -- Define what the runbook covers.

- Name the task, its frequency, and who runs it
- List prerequisites: access, tools, credentials, prior state
- Identify the trigger: scheduled, event-driven, or manual invocation
- Ask: "If a new hire ran this at 3am during an incident, what would they need?"

**Gate**: Task named. Prerequisites listed. Trigger defined.

**Phase 2: AUTHOR** -- Write the procedure with painful specificity.

Load `references/runbook-authoring.md`.

Critical rules:
- Every step has: exact command/action, expected result, failure handling
- "Run the script" is NOT a step. `python sync.py --prod --dry-run` from `/opt/ops/` as `deploy-user` IS a step
- Include verification after every state-changing step
- Rollback procedure for the entire runbook AND per-step rollback where applicable
- Escalation paths with names, contact methods, and when-to-escalate triggers

| Step Component | Required | Example |
|---------------|----------|---------|
| Action | Yes | `kubectl rollout restart deployment/api -n production` |
| Expected result | Yes | "Pods restart within 60s. `kubectl get pods` shows 3/3 Running." |
| Failure handling | Yes | "If pods stay in CrashLoopBackOff >2min, proceed to Rollback." |
| Verification | Yes | `curl -s https://api.example.com/health | jq .status` returns `"ok"` |
| Rollback | Per-step | `kubectl rollout undo deployment/api -n production` |

**Gate**: Every step has all five components. Rollback procedure exists. Escalation path defined.

**Phase 3: VERIFY** -- Validate the runbook is actually usable.

- Walk through the runbook as if you have never seen the system
- Flag any step that requires unstated knowledge
- Confirm the troubleshooting table covers symptoms from each step's failure mode
- Check: could someone follow this at 3am with no prior context?

**Gate**: All steps self-contained. No implicit knowledge. Troubleshooting table complete.

---

### Mode: RISK

**Framework**: IDENTIFY -> ASSESS -> MITIGATE

**Phase 1: IDENTIFY** -- Enumerate risks systematically by category.

Load `references/risk-assessment.md`.

| Category | What to Look For |
|----------|-----------------|
| Operational | Process failures, staffing gaps, system outages, single points of failure |
| Financial | Budget overruns, vendor cost increases, revenue impact, currency exposure |
| Compliance | Regulatory violations, audit findings, policy breaches, certification gaps |
| Strategic | Market changes, competitive threats, technology shifts, dependency risks |
| Reputational | Customer impact, public perception, partner relationships, data incidents |
| Security | Data breaches, access control failures, third-party vulnerabilities |

- Extend risk identification beyond obvious items. Ask: "What kills us if it happens, even if it seems unlikely?"
- Separate risks from issues. A risk might happen. An issue already has.

**Gate**: Risks enumerated across all applicable categories. Each risk has a clear description.

**Phase 2: ASSESS** -- Score each risk on probability and impact.

Apply the probability x impact matrix:

| | Low Impact | Medium Impact | High Impact |
|---|-----------|---------------|-------------|
| **High Probability** | Medium | High | Critical |
| **Medium Probability** | Low | Medium | High |
| **Low Probability** | Low | Low | Medium |

For each risk:
- Probability: base on evidence, not optimism. "It hasn't happened yet" is not "low probability"
- Impact: quantify in dollars, hours, or affected users where possible
- Risk level: derived from matrix, not gut feel

**Gate**: Every risk scored. No unquantified "High" without supporting rationale.

**Phase 3: MITIGATE** -- Plan mitigations and track residual risk.

For each High/Critical risk:
- Mitigation action (specific, not "monitor the situation")
- Owner (named person, not "the team")
- Timeline (date, not "soon")
- Residual risk after mitigation
- Acceptance criteria: what makes the residual risk acceptable?

**Gate**: All High/Critical risks have mitigations with owners and dates. Residual risk