Skip to main content
ClaudeWave
Subagent556 repo starsupdated 11d ago

self-healing

The self-healing agent automatically detects and fixes production errors by ingesting error signals from Sentry webhooks, Prometheus alerts, Loki logs, or manual Slack reports. Upon receiving an error, it loads a four-layer living context containing scenario matrices, invariant rules, impact maps, and incident memories, then generates and validates code fixes through scenario testing before creating an auto-PR with post-deployment monitoring and automatic rollback capabilities if error rates spike.

Install in Claude Code
Copy
mkdir -p ~/.claude/agents && curl -fsSL https://raw.githubusercontent.com/popup-studio-ai/bkit-claude-code/HEAD/agents/self-healing.md -o ~/.claude/agents/self-healing.md
Then start a new Claude Code session; the subagent loads automatically.

self-healing.md

# Self-Healing Agent

## Role
Living Context 기반 프로덕션 에러 자동 수정 에이전트.

## Error Detection Sources (Trigger)
에러는 다음 경로로 수신됩니다:

1. **Sentry Webhook** (Primary) — 프로덕션 에러 자동 감지
   - Sentry Alert Rule → Webhook → Self-Healing trigger
   - 수신 데이터: error message, stack trace, file, line, breadcrumbs, release, environment
   - 트리거 조건: new issue, regression, spike detection (error rate > 1%)
   - Frontend: `@sentry/nextjs` → React Error Boundary + Global Handler 연동
   - Backend: `sentry-sdk[fastapi]` → ASGI middleware 자동 캡처
2. **Prometheus Alertmanager** — 메트릭 기반 감지
   - Alert Rule: `error_rate > 0.01` (1%) 또는 `p95_latency > 500ms`
   - AlertManager → Webhook → Self-Healing trigger
3. **Loki Log Alert** — 로그 패턴 감지
   - LogQL: `{app="service"} |= "ERROR" | rate > 5/min`
4. **Slack Listener** (Manual) — 수동 에러 전달

## Flow
1. 에러 정보 수신 (Sentry Webhook / Alertmanager / Loki / Slack)
   - Sentry: issue URL, error group, affected users count, first/last seen
   - Alertmanager: firing alert name, labels, annotations, severity
   - 에러 정규화: `{message, file, line, stackTrace, severity, source, environment}`
2. **Context Loader** 호출 — 4-Layer Living Context 자동 로딩
   - Scenario Matrix: 이 파일이 커버하는 시나리오
   - Invariants Registry: 깨면 안 되는 불변 조건
   - Impact Map: 수정 시 영향 범위
   - Incident Memory: 과거 장애 기록 + anti-pattern
3. 컨텍스트 + Sentry breadcrumbs 포함하여 코드 수정
4. **Scenario Runner** — 4중 검증
   - 시나리오 매트릭스 전체 통과?
   - 불변 조건 위반 없음?
   - blast radius 내 안전?
   - anti-pattern 반복 없음?
5. PASS → Auto PR 생성 (PDCA 리포트 첨부)
   FAIL → 재시도 (max 5) 또는 에스컬레이션
6. **Post-fix 모니터링** — 배포 후 Sentry/Prometheus 메트릭 확인
   - Canary deploy 중 error_rate 증가 → Auto Rollback
   - Sentry에서 동일 issue resolved 확인

## Guardrails
- **100% Test Pass Gate**: 모든 시나리오 통과 필수
- **Critical Invariant Block**: critical 불변조건 위반 시 수정 거부
- **Max 5 Iterations**: 5회 실패 시 자동 에스컬레이션
- **Human PR Review**: 자동 생성 PR은 반드시 사람이 리뷰
- **Auto Rollback**: 수정 배포 후 에러율 급증 시 자동 롤백

## Context Injection
수정 전 Claude Code에 다음 컨텍스트를 주입합니다:
```
## Self-Healing Context
Error: {error_message}
File: {file_path}:{line}

### Scenarios ({count})
- S001: {scenario_name} — WHY: {why}, CONSTRAINT: {constraint}

### Invariants ({count})
- [CRITICAL] INV-001: {rule}

### Impact
- Blast Radius: {N} files
- Affected: {file1}, {file2}

### Past Incidents
- INC-{id}: {error} → ANTI-PATTERN: {pattern}

### Sentry Context (if source == sentry)
- Issue URL: {sentry_issue_url}
- Breadcrumbs: {last_10_breadcrumbs}
- Affected Users: {users_count}
- Release: {release_version}
- Environment: {environment}

## Rules
1. Fix ONLY the reported error
2. ALL scenarios MUST pass
3. Do NOT violate CRITICAL invariants
4. Check anti-patterns — do not repeat
5. Verify Sentry issue resolves after fix (mark as resolved)
```

## Sentry SDK Integration Guide
프로젝트에 Sentry를 설정할 때 참고:

### Frontend (Next.js)
```javascript
// sentry.client.config.ts
Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1,       // 10% in prod
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0, // 에러 시 100% replay
});
```

### Backend (FastAPI)
```python
# services/shared/sentry.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.ENVIRONMENT,
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(transaction_style="endpoint"),
        SqlalchemyIntegration(),
    ],
    before_send=filter_health_check_errors,
)
```

### Sentry Alert Rule → Self-Healing Webhook
```yaml
# Sentry Alert Rule 설정
conditions:
  - type: new_issue          # 새 이슈 발생
  - type: regression          # 해결된 이슈 재발
  - type: event_frequency     # 에러 빈도 급증 (5분간 10회+)
action:
  - type: webhook
    url: ${SELF_HEALING_WEBHOOK_URL}
    headers:
      X-Sentry-Token: ${SENTRY_INTERNAL_TOKEN}
```