ci-debug
Diagnose a failing CI run against a 10-pattern playbook. Classifies the failure, cites the relevant memory entry, proposes the exact fix command — but NEVER applies without explicit user approval. Use when a specific PR check or GitHub Actions run failed and you want a diagnosis instead of speculation. Don't use for org-wide CI sweeps (that's /status) or for app-level test failures (the playbook is CI-infra-specific).
git clone --depth 1 https://github.com/yonatangross/orchestkit /tmp/ci-debug && cp -r /tmp/ci-debug/plugins/ork/skills/ci-debug ~/.claude/skills/ci-debugSKILL.md
# /ci-debug — classify a failing CI run
Direct response to the recurring CI-debug pattern surfaced by `/insights`: ~12 sessions in 3 weeks doing the same classification dance. This skill encodes the 10 patterns so the dance becomes a lookup.
## Input
User invokes with one of:
- **PR number**: `/ci-debug 822` (default repo from context; ask if ambiguous)
- **Run URL**: `/ci-debug https://github.com/owner/repo/actions/runs/12345`
- **Job URL**: `/ci-debug https://github.com/owner/repo/actions/runs/X/job/Y`
## Execution
### 1. Resolve the failing job
```bash
# From PR number:
gh pr checks <n> --repo <owner>/<repo> --json bucket,link,name \
--jq '.[] | select(.bucket=="fail") | "\(.name)|\(.link)"'
# From run URL:
gh api repos/<owner>/<repo>/actions/runs/<run-id>/jobs \
--jq '.jobs[] | select(.conclusion=="failure")
| {id, name, runner_name, started_at, completed_at,
steps: [.steps[] | select(.conclusion=="failure") | {name, number}]}'
```
If multiple jobs failed, pick the one with the **shortest duration** — root cause is usually the first failure; later jobs cascade.
### 2. Fetch the failing log
```bash
gh api repos/<owner>/<repo>/actions/jobs/<job_id>/logs 2>&1 \
| grep -iE '(error|fail|ERR_|CONFLICT|Process completed with exit code)' \
| head -30
```
Capture the **FIRST distinct error message** (later lines often echo).
### 3. Classify against the playbook
Walk the patterns in order. **First match wins.**
| # | Pattern | Signature in logs | Memory ref | Proposed fix |
|---|---------|-------------------|------------|--------------|
| 1 | **Billing block** | runner_name empty + steps[] empty + ~3s duration + annotation: "recent account payments have failed or your spending limit needs to be increased" | `billing-surface-hosted-vs-self-hosted.md` | Org admin → Settings → Billing & plans → raise limit / update card. No code change. |
| 2 | **Root-lockfile drift** | `ERR_PNPM_OUTDATED_LOCKFILE` mentioning `<ROOT>/typescript/<pkg>/package.json` | `pnpm-lock-root-vs-workspace-duality.md` | `pnpm install --lockfile-only && git add pnpm-lock.yaml && git commit && git push`. |
| 3 | **uv.lock drift** | `error: The lockfile at uv.lock needs to be updated` | `changeset-release-uv-lock-drift.md` | `cd python && uv lock` then commit. |
| 4 | **ci-shared.yml missing permissions** | startup_failure pattern (empty runner_name + steps[]=[] + ~3s) BUT billing is resolved | `ci-shared-permissions-block-required.md` | Add `permissions: { contents: read, packages: read }` to the caller workflow. |
| 5 | **YAML python embed** | YAML parse error pointing at a multi-line block scalar with `python -c` | `yaml-python-embed.md` | Rewrite `python -c` as a separate shell script invocation; never inline multi-line python in YAML. |
| 6 | **actionlint shellcheck false-positive** | audit/actionlint job failing with SC2086/SC2046 on workflow YAMLs you didn't touch | `audit-actionlint-triggers-on-workflow-edit.md` | Not required check; safe to merge past if the warnings predate your change. Optional: add shellcheck disable comments. |
| 7 | **macOS BSD date %3N** | `%3N` printed literally in CI output / arithmetic fails | `macos-bsd-date-no-percent-3N.md` | Replace `date +%s%3N` with `node -e 'console.log(Date.now())'` or `python3 -c 'import time; print(int(time.time()*1000))'`. |
| 8 | **Runner pnpm Rosetta arch drift** | pnpm install fails with "wrong-arch native bin" / dlopen error on a self-hosted runner | `runner-pnpm-rosetta-arch-drift.md` | Restart the affected runner pool; root cause is node x64↔arm64 flips storing wrong-arch native bins in shared cache. |
| 9 | **Shallow clone false divergence** | `git status` reports diverged but PR was actually merged | `shallow-clone-false-divergence.md` | `git fetch origin <branch> --unshallow` then `gh pr view --merge-commit` to verify. |
| 10 | **Publish run cancelled** | Publish-tag workflow run shows `conclusion=cancelled`; artifact never lands | `publish-runs-cancelled-need-redrive.md` | Re-fire via `gh workflow run publish-python.yml -f tag=<tag>` (adjust for your publish workflow). |
The memory references point at user-curated memory files (`~/.claude/projects/<project>/memory/*.md`). If your memory doesn't have them yet, the signature column is enough to classify — the memory citation is a nice-to-have, not required.
### 4. Report
For a **matched** pattern:
```markdown
## CI Debug: <repo> · <pr-or-run-ref>
**Failing job:** `<job name>` (<duration>s) on runner `<runner_name>`
**Failing step:** <step name> (#<step number>)
**Error excerpt:**
\`\`\`
<first 3 lines of grep'd error>
\`\`\`
**Classification:** Pattern #<n> — <pattern name>
**Reference:** memory `<memory-file.md>`
**Proposed fix:**
<exact commands, one per line>
**Will I apply this?** No — awaiting your approval. Reply "go" to ship.
```
For an **UNMATCHED** failure:
```markdown
## CI Debug: <repo> · <pr-or-run-ref> · NOVEL
**Failing step:** <step name>
**Unique log lines:**
\`\`\`
<top 10 distinct error lines>
\`\`\`
This doesn't match any of the 10 playbook patterns. Surfacing the raw
evidence for your read. Once you identify the root cause, consider
adding it to the playbook (in this SKILL.md) so the next run catches
it automatically.
```
## Headless invocation (permission mode)
`/ci-debug` REQUIRES Bash tool access — `gh pr checks`, `gh run view`, and `gh api .../jobs/.../logs` are how it fetches evidence. In headless `claude -p` runs (ci-sentinel, cron):
- Use `--permission-mode acceptEdits` — the headless "use tools without prompting" mode. The skill is read-only by design, so this grants nothing risky.
- **NEVER `dontAsk`** — it silently REFUSES permission-requiring tools (including Bash), so every analysis returns empty output with no error (#1862 Bug C).
- `claude -p` reports auth/permission failures as JSON on **stdout**, not stderr — capture and log both streams on failure (see `shared/rules/cc-bare-auth-gotcha.md` for the auAccessibility patterns for WCAG 2.2 compliance, keyboard focus management, React Aria component patterns, cognitive inclusion, native HTML-first philosophy, and user preference honoring. Use when implementing screen reader support, keyboard navigation, ARIA patterns, focus traps, accessible component libraries, reduced motion, or cognitive accessibility.
Agent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.
AI-assisted UI generation patterns for json-render, v0.app, Google Stitch, Bolt Cloud, and Cursor workflows. Covers prompt engineering for component and full-stack app generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
Queries local analytics across OrchestKit projects for agent usage, skill frequency, hook timing, team activity, session replay, cost estimation, and model delegation trends. Privacy-safe with hashed project IDs. Supports time-range filtering and comparative analysis. Use when reviewing performance, estimating costs, or understanding usage patterns.
Animation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.
API design patterns for REST/GraphQL framework design, versioning strategies, and RFC 9457 error handling. Use when designing API endpoints, choosing versioning schemes, implementing Problem Details errors, or building OpenAPI specifications.
Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.
Architecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.