Skill208 repo starsupdated today

ci-debug

ci-debug diagnoses failing CI runs by matching error patterns against a 10-item playbook that covers billing blocks, lockfile drift, permission issues, and other infrastructure-level failures. Use this skill when a specific GitHub Actions run or PR check fails and you need structured classification with a proposed fix command rather than speculation. Do not use for organization-wide CI status checks or application-level test failures.

View source Repository: orchestkit

Install in Claude Code

Copy

git clone --depth 1 https://github.com/yonatangross/orchestkit /tmp/ci-debug && cp -r /tmp/ci-debug/plugins/ork/skills/ci-debug ~/.claude/skills/ci-debug

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# /ci-debug — classify a failing CI run

Direct response to the recurring CI-debug pattern surfaced by `/insights`: ~12 sessions in 3 weeks doing the same classification dance. This skill encodes the 11 patterns so the dance becomes a lookup.

## Input

User invokes with one of:

- **PR number**: `/ci-debug 822` (default repo from context; ask if ambiguous)
- **Run URL**: `/ci-debug https://github.com/owner/repo/actions/runs/12345`
- **Job URL**: `/ci-debug https://github.com/owner/repo/actions/runs/X/job/Y`

## Execution

### 1. Resolve the failing job

```bash
# From PR number:
gh pr checks <n> --repo <owner>/<repo> --json bucket,link,name \
  --jq '.[] | select(.bucket=="fail") | "\(.name)|\(.link)"'

# From run URL:
gh api repos/<owner>/<repo>/actions/runs/<run-id>/jobs \
  --jq '.jobs[] | select(.conclusion=="failure")
                | {id, name, runner_name, started_at, completed_at,
                   steps: [.steps[] | select(.conclusion=="failure") | {name, number}]}'
```

If multiple jobs failed, pick the one with the **shortest duration** — root cause is usually the first failure; later jobs cascade.

**No job in the `fail` bucket but a check won't settle?** If `gh pr checks` shows zero `fail`-bucket entries yet a status sits in `pending` that never resolves (and `gh pr view --json mergeStateStatus` returns `UNSTABLE` while `mergeable=MERGEABLE`), this is a *stuck external status*, not a failure — jump straight to **Pattern #11**. There is no failing log to fetch; classify on the commit-status metadata (`gh api repos/<o>/<r>/commits/<sha>/status`).

### 2. Fetch the failing log

```bash
gh api repos/<owner>/<repo>/actions/jobs/<job_id>/logs 2>&1 \
  | grep -iE '(error|fail|ERR_|CONFLICT|Process completed with exit code)' \
  | head -30
```

Capture the **FIRST distinct error message** (later lines often echo).

### 3. Classify against the playbook

Walk the patterns in order. **First match wins.**

| # | Pattern | Signature in logs | Memory ref | Proposed fix |
|---|---------|-------------------|------------|--------------|
| 1 | **Billing block** | runner_name empty + steps[] empty + ~3s duration + annotation: "recent account payments have failed or your spending limit needs to be increased" | `billing-surface-hosted-vs-self-hosted.md` | Org admin → Settings → Billing & plans → raise limit / update card. No code change. |
| 2 | **Root-lockfile drift** | `ERR_PNPM_OUTDATED_LOCKFILE` mentioning `<ROOT>/typescript/<pkg>/package.json` | `pnpm-lock-root-vs-workspace-duality.md` | `pnpm install --lockfile-only && git add pnpm-lock.yaml && git commit && git push`. |
| 3 | **uv.lock drift** | `error: The lockfile at uv.lock needs to be updated` | `changeset-release-uv-lock-drift.md` | `cd python && uv lock` then commit. |
| 4 | **ci-shared.yml missing permissions** | startup_failure pattern (empty runner_name + steps[]=[] + ~3s) BUT billing is resolved | `ci-shared-permissions-block-required.md` | Add `permissions: { contents: read, packages: read }` to the caller workflow. |
| 5 | **YAML python embed** | YAML parse error pointing at a multi-line block scalar with `python -c` | `yaml-python-embed.md` | Rewrite `python -c` as a separate shell script invocation; never inline multi-line python in YAML. |
| 6 | **actionlint shellcheck false-positive** | audit/actionlint job failing with SC2086/SC2046 on workflow YAMLs you didn't touch | `audit-actionlint-triggers-on-workflow-edit.md` | Not required check; safe to merge past if the warnings predate your change. Optional: add shellcheck disable comments. |
| 7 | **macOS BSD date %3N** | `%3N` printed literally in CI output / arithmetic fails | `macos-bsd-date-no-percent-3N.md` | Replace `date +%s%3N` with `node -e 'console.log(Date.now())'` or `python3 -c 'import time; print(int(time.time()*1000))'`. |
| 8 | **Runner pnpm Rosetta arch drift** | pnpm install fails with "wrong-arch native bin" / dlopen error on a self-hosted runner | `runner-pnpm-rosetta-arch-drift.md` | Restart the affected runner pool; root cause is node x64↔arm64 flips storing wrong-arch native bins in shared cache. |
| 9 | **Shallow clone false divergence** | `git status` reports diverged but PR was actually merged | `shallow-clone-false-divergence.md` | `git fetch origin <branch> --unshallow` then `gh pr view --merge-commit` to verify. |
| 10 | **Publish run cancelled** | Publish-tag workflow run shows `conclusion=cancelled`; artifact never lands | `publish-runs-cancelled-need-redrive.md` | Re-fire via `gh workflow run publish-python.yml -f tag=<tag>` (adjust for your publish workflow). |
| 11 | **Vercel status orphaned (path-skip)** | No job in the `fail` bucket, but `Vercel` appears as a *commit status* (not a check-run) stuck `state=pending` with `created_at == updated_at` and no terminal update; all GitHub Actions checks green; `mergeStateStatus=UNSTABLE` + `mergeable=MERGEABLE` on an unprotected base branch | `vercel-pending-orphaned-on-path-skip.md` | Not a failure — cosmetic. Vercel posted a `pending` status then **skipped** the build (project-root path filter, e.g. a docs-only change that never touches `apps/web`), orphaning the status. Safe to merge: `gh pr merge <n> --repo <owner>/<repo> --squash`. Permanent fix: the Vercel project's **Ignored Build Step** must exit `0` AND report success for skipped paths so the status flips instead of dangling. |

The memory references point at user-curated memory files (`~/.claude/projects/<project>/memory/*.md`). If your memory doesn't have them yet, the signature column is enough to classify — the memory citation is a nice-to-have, not required.

### 4. Report

For a **matched** pattern:

```markdown
## CI Debug: <repo> · <pr-or-run-ref>

**Failing job:** `<job name>` (<duration>s) on runner `<runner_name>`
**Failing step:** <step name> (#<step number>)
**Error excerpt:**
\`\`\`
<first 3 lines of grep'd error>
\`\`\`

**Classification:** Pattern #<n> — <pattern name>
**Reference:** memory `<memory-file.md>`

**Proposed