Skill91 estrellas del repoactualizado 3mo ago

error-recovery

The error-recovery Claude Code skill classifies workflow failures into critical blockers requiring manual intervention and fixable issues amenable to automatic recovery strategies. Use it when sprints or features fail during implementation phases to determine whether the system can auto-fix problems like test failures and dependency issues through retry mechanisms or requires immediate user intervention for critical failures like CI pipeline crashes, security vulnerabilities, and deployment failures.

Ver fuente Repositorio: Spec-Flow

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/marcusgoll/Spec-Flow /tmp/error-recovery && cp -r /tmp/error-recovery/.claude/skills/error-recovery ~/.claude/skills/error-recovery

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

<objective>
Classify workflow failures into actionable categories and execute recovery strategies.

Use this skill when:
- Sprint fails during /implement-epic
- Feature fails during /implement
- Quality gate blocks deployment
- Any workflow phase returns FAILED status
</objective>

<quick_start>
When a workflow component fails:

1. **Classify the failure** using the decision tree below
2. **If fixable**: Execute auto-fix strategies (max 3 attempts)
3. **If critical**: Stop immediately, report to user
4. **If unknown**: Treat as critical (safe default)

Classification determines whether workflow can auto-recover or requires user intervention.
</quick_start>

<failure_classification>
## Critical Blockers (MUST STOP)

These failures require manual intervention - never auto-retry:

| Failure Type | Detection | Why Critical |
|--------------|-----------|--------------|
| CI Pipeline | `ci_pipeline_failed: true` in state.yaml | GitHub Actions/GitLab CI failed - indicates code issue |
| Security Scan | `security_scan_failed: true` | High/Critical CVEs found - security risk |
| Deployment | `deployment_failed: true` | Production/staging crashed - user-facing impact |
| Contract Violation | `contract_violations > 0` | API contract broken - breaks consumers |

**Action**: Stop workflow, report error details, require `/continue` after manual fix.

## Fixable Issues (CAN AUTO-RETRY)

These failures often resolve with automated fixes:

| Failure Type | Detection | Auto-Fix Strategies |
|--------------|-----------|---------------------|
| Test Failures | `tests_failed: true` (no CI failure) | re-run-tests, check-dependencies |
| Build Failures | `build_failed: true` | clear-cache, reinstall-deps, rebuild |
| Dependency Issues | `dependencies_failed: true` | clean-install, clear-lockfile |
| Infrastructure | `infrastructure_issues: true` | restart-services, check-ports |
| Type Errors | `type_check_failed: true` | (manual fix usually required) |

**Action**: Attempt auto-fix strategies (max 3 attempts), then escalate if all fail.

## Classification Decision Tree

```
Is CI pipeline failing?
├── Yes → CRITICAL (code issue, needs manual fix)
└── No → Continue...

Is security scan failing?
├── Yes → CRITICAL (security risk, needs review)
└── No → Continue...

Is deployment failing?
├── Yes → CRITICAL (production impact, needs rollback)
└── No → Continue...

Are tests failing (locally only)?
├── Yes → FIXABLE (try: re-run, check deps, clear cache)
└── No → Continue...

Is build failing?
├── Yes → FIXABLE (try: clear cache, reinstall, rebuild)
└── No → Continue...

Unknown failure?
└── CRITICAL (safe default - don't auto-retry unknown issues)
```
</failure_classification>

<auto_fix_strategies>
## Strategy Execution

Execute strategies in order until one succeeds. Progressive delays between attempts.

### re-run-tests
```bash
# Tests may be flaky - simple re-run often works
cd "${SPRINT_DIR}" && npm test
# or: pytest, cargo test, etc.
```
**Timeout**: 120s
**When**: Test failures that aren't CI-related

### check-dependencies
```bash
# Verify all dependencies installed
cd "${SPRINT_DIR}" && npm list --depth=0
# If missing deps found:
npm install
```
**Timeout**: 120s
**When**: Import errors, module not found

### clear-cache
```bash
# Clear build/test caches
cd "${SPRINT_DIR}"
rm -rf .next .cache coverage node_modules/.cache dist build
npm run build
```
**Timeout**: 180s
**When**: Stale cache causing build issues

### clean-install
```bash
# Nuclear option for dependency issues
cd "${SPRINT_DIR}"
npm cache clean --force
rm -rf node_modules package-lock.json
npm install
```
**Timeout**: 300s
**When**: Corrupted node_modules, lockfile conflicts

### rebuild
```bash
# Full rebuild from scratch
cd "${SPRINT_DIR}"
rm -rf dist build .next
npm run build
```
**Timeout**: 180s
**When**: Build artifacts corrupted

### restart-services
```bash
# Restart Docker/database services
docker-compose down
docker-compose up -d
sleep 10  # Wait for services to initialize
```
**Timeout**: 150s
**When**: Database connection failures, service unavailable

### check-ports
```bash
# Kill processes blocking required ports
lsof -ti:3000,5432,6379 | xargs kill -9 2>/dev/null || true
```
**Timeout**: 30s
**When**: Port already in use errors

## Retry Logic

```
Attempt 1: Try all strategies in order
  ↓ (wait 5s)
Attempt 2: Try all strategies again
  ↓ (wait 10s)
Attempt 3: Final attempt
  ↓ (if still failing)
Escalate to user (all strategies exhausted)
```

Maximum total retry time: ~15 minutes before escalation.
</auto_fix_strategies>

<usage_in_commands>
## How to Use This Skill

### In implement-epic.md or implement.md:

```markdown
When sprint/feature fails:

1. Load error-recovery skill:
   Skill("error-recovery")

2. Classify failure:
   - Read state.yaml for failure indicators
   - Apply decision tree from skill
   - Determine: CRITICAL or FIXABLE

3. If FIXABLE and auto-mode enabled:
   - Execute strategies from skill (max 3 attempts)
   - Re-check status after each attempt
   - If recovered: continue workflow
   - If exhausted: escalate

4. If CRITICAL or auto-fix exhausted:
   - Stop workflow
   - Report error with classification
   - Instruct: "Fix manually, then /continue"
```

### Example Integration:

```bash
# After sprint execution returns failure
SPRINT_STATE=$(cat "${EPIC_DIR}/sprints/${SPRINT_ID}/state.yaml")

# Check for critical blockers first
if echo "$SPRINT_STATE" | grep -q "ci_pipeline_failed: true"; then
    echo "CRITICAL: CI pipeline failed - manual fix required"
    exit 1
fi

if echo "$SPRINT_STATE" | grep -q "security_scan_failed: true"; then
    echo "CRITICAL: Security vulnerabilities detected - manual review required"
    exit 1
fi

# Check for fixable issues
if echo "$SPRINT_STATE" | grep -q "tests_failed: true"; then
    echo "FIXABLE: Tests failing - attempting auto-recovery..."
    # Execute strategies from skill
fi
```
</usage_in_commands>

<reporting>
## Error Reporting Format

When escalating t

Del mismo repositorio

implement-epicSlash Command

Execute multiple sprints in parallel based on dependency graph from sprint-plan.md

build-localSlash Command

Build and validate locally for projects without remote deployment (prototypes, experiments, local-only dev)

epicSlash Command

Execute multi-sprint epic workflow from interactive scoping through deployment with parallel sprint execution and self-improvement

featureSlash Command

Execute feature development workflow from specification through production deployment with automated quality gates

helpSlash Command

Analyze workflow state and provide context-aware guidance with visual progress indicators and recommended next steps

initSlash Command

Initialize project documentation, preferences, or design tokens

quickSlash Command

Implement small bug fixes and features (<100 LOC) without full workflow. Use for single-file changes, bug fixes, refactors, and minor enhancements that can be completed in under 30 minutes.

ultrathinkSlash Command

Enter deep craftsman mode - question everything, plan like Da Vinci, craft insanely great solutions, then materialize to roadmap