error-handling
Graceful degradation and meaningful error messages. Errors are first-class citizens, not afterthoughts. Every error path is designed, not discovered.
git clone --depth 1 https://github.com/DevelopersGlobal/ai-agent-skills /tmp/error-handling && cp -r /tmp/error-handling/skills/error-handling ~/.claude/skills/error-handlingSKILL.md
## Overview
Error handling is not defensive programming — it's a user experience. When things go wrong (and they will), the system should degrade gracefully, give users actionable information, and leave enough telemetry to diagnose and fix the problem.
## When to Use
- When writing any code that can fail (I/O, network, parsing, user input)
- When reviewing error handling in existing code
- Before any service goes to production
## Process
### Step 1: Design Error Paths Explicitly
1. For every operation, list: what can fail? What does failure look like?
2. Classify failures:
- **Transient**: Retry likely to succeed (network blip, temporary unavailability)
- **Client error**: Bad input from the caller (4xx) — don't retry
- **System error**: Internal failure (5xx) — alert, investigate
3. Design the failure path for each class before writing the happy path.
**Verify:** Error classes defined for every external operation.
### Step 2: Meaningful Error Messages
4. Every error message answers: what went wrong? How can the caller fix it?
- ✅ "Invalid email format. Expected: user@domain.com"
- ❌ "Validation error"
5. User-facing errors: friendly language, no stack traces.
6. Developer-facing errors (logs): full context, request ID, stack trace.
7. Never expose internal system details (DB schema, file paths) in user-facing errors.
**Verify:** Each error message would help a user or developer understand and fix the problem.
### Step 3: Retry with Backoff
8. Transient errors: retry with exponential backoff + jitter.
9. Maximum retries: 3 (not infinite).
10. After max retries: fail with a clear error, log the final failure.
11. Non-transient errors (validation, auth): never retry.
**Verify:** Retry logic has a maximum. Non-transient errors don't retry.
### Step 4: Graceful Degradation
12. Identify non-critical dependencies. If they fail, degrade — don't crash.
13. Example: recommendation engine fails → show default content, not 500.
14. Circuit breaker pattern for failing dependencies: fail fast after threshold, recover automatically.
**Verify:** Every non-critical dependency has a defined degraded state.
### Step 5: Structured Error Responses (APIs)
15. API errors return consistent structure:
```json
{
"error": {
"code": "INVALID_EMAIL",
"message": "The email address format is invalid.",
"requestId": "req_abc123"
}
}
```
16. HTTP status codes used correctly: 400 (client error), 404 (not found), 429 (rate limited), 500 (server error).
## Common Rationalizations (and Rebuttals)
| Excuse | Rebuttal |
|--------|----------|
| "I'll add error handling later" | Later means in production, under pressure, while users are impacted. |
| "This can't fail" | Everything can fail. Network calls, disk writes, parsing — all can fail. |
| "The error message doesn't matter" | It matters when a developer is debugging at 2am. |
## Verification
- [ ] Error classes defined for every external operation
- [ ] Error messages answer: what went wrong? How to fix?
- [ ] Transient errors retry with max limit
- [ ] Non-critical dependencies have graceful degraded states
- [ ] API errors return consistent structured format
- [ ] No internal system details in user-facing errors
## References
- [observability skill](../observability/SKILL.md)
- [debugging-methodology skill](../debugging-methodology/SKILL.md)Validates, parses, and sanitizes AI-generated outputs before they reach end users or downstream systems. Structured output enforcement, schema validation, and fallback handling.
Design stable, versioned, self-documenting APIs. Easy to use correctly, hard to use incorrectly. Apply Hyrum's Law from day one.
Automated quality gates from commit to production. Every merge to main is potentially shippable. No manual steps in the deployment path.
Get layered, context-aware explanations of unfamiliar code. Understand what it does, why it was written that way, and how to work with it safely.
Structured code review focusing on correctness, security, and maintainability. Correctness before style. Every reviewer comment must be actionable.