fail-fast-no-hedging
# fail-fast-no-hedging This Claude Code skill audits Python codebases to identify component hedging anti-patterns where required infrastructure failures are masked as normal operation through defensive error handling. Use it during code reviews and architectural assessments to enforce fail-fast principles, ensuring systems fail loudly when critical dependencies break rather than degrading silently, which improves production reliability and operational visibility by making infrastructure failures immediately detectable.
git clone --depth 1 https://github.com/taylorsatula/mira-OSS /tmp/fail-fast-no-hedging && cp -r /tmp/fail-fast-no-hedging/.claude/skills/fail-fast-no-hedging ~/.claude/skills/fail-fast-no-hedgingSKILL.md
# Fail-Fast Engineering: Architectural Honesty Over Silent Degradation
## 🚨 Core Principle
**Hedging** is treating **required** infrastructure as optional through defensive try/except blocks that mask failures as normal operation.
**The Critical Question:** Is this component required for the system to work correctly, or is it a genuine optional enhancement?
When **required** infrastructure fails, the system MUST fail loudly. When **optional** features fail, logging and continuing may be appropriate.
## 📋 Assessment Output Format (REQUIRED)
When analyzing a codebase, conclude with this structured list of all hedging issues found:
```markdown
## Hedging Anti-Patterns Found in [directory/module name]
**N files with component hedging violations:**
1. **filename.py** (SEVERITY)
- Location: `function_name()` lines X-Y
- Issue: Brief description of what's wrong
- Hedging: Explanation of how it's treating required infrastructure as optional
- Impact: What happens in production when infrastructure fails
2. **filename.py** (SEVERITY)
- Location: `function_name()` lines X-Y
- Issue: Brief description of what's wrong
- Hedging: Explanation of how it's treating required infrastructure as optional
- Impact: What happens in production when infrastructure fails
```
**Example (from lt_memory/ audit):**
```markdown
## Hedging Anti-Patterns Found in lt_memory/
**3 files with component hedging violations:**
1. **db_access.py** (HIGH SEVERITY)
- Location: `get_or_create_entity()` lines 916-933
- Issue: Returns `None` when database INSERT...RETURNING fails, masking query failure as "entity not found"
- Hedging: Docstring declares return type as `Entity`, implementation silently returns `None` on infrastructure failure
- Impact: Caller cannot distinguish between "entity doesn't exist" (legitimate) vs "database query failed" (infrastructure down)
2. **extraction.py** (HIGH SEVERITY)
- Location: `_parse_extraction_response()` lines 396-397 and 430
- Issue: Returns `[]` (empty list) when JSON parsing fails, instead of raising as docstring declares
- Hedging: Docstring says "Raises: ValueError" but code returns `[]` on parse failures
- Impact: Caller sees empty list and treats it as "no memories extracted" when actually LLM returned invalid JSON
```
This format makes all issues immediately scannable before remediation work begins.
---
## 🎯 Quick-Start Guide
When analyzing a new codebase:
1. **Identify Infrastructure Dependencies**
- Database connections
- Cache/session stores
- External APIs
- Message queues
- File systems
2. **Apply Three Diagnostic Tests**
- **Semantic Distinction Test**: Can you distinguish "no data" from "infrastructure down"?
- **Never Executes Test**: Will this fallback realistically run during normal operation?
- **Contract Match Test**: Does behavior match the docstring/type hints?
3. **Look for These Red Flags**
```
except Exception: return [] # Infrastructure failure → empty data
except Exception: return False # Connection error → "not allowed"
except Exception: return None # Database down → "not found"
```
## 🌍 Real-World Patterns from Production Codebases
Based on systematic removal of 40+ hedging anti-patterns across production systems, these specific patterns emerge repeatedly:
### 1. Infrastructure Failures Converted to Client Errors (CRITICAL)
**Pattern**: Catching database/service failures and raising ValidationError (400) instead of letting them propagate as 500s.
```python
# REAL EXAMPLE: CNS API Layer
def execute_action(self, action: str, data: Dict) -> Dict:
try:
session_manager = get_shared_session_manager()
lt_db = LTMemoryDB(session_manager)
except Exception as e:
# Database down converted to "your input is invalid"!
if "connection" in str(e) or "database" in str(e):
raise ValidationError(f"Database connection failed: {e}")
```
**Impact**: Users see "Bad Request" when database is down. Monitoring doesn't alert (watches 500s, not 400s). Operators think users are sending bad data while infrastructure burns.
**Fix**: Remove business-layer exception translation. Let infrastructure exceptions bubble to API boundary where proper HTTP status translation happens.
### 2. The Availability Flag Plague (PERVASIVE)
**Pattern**: Setting `self.component_available = False` during init, then checking it hundreds of times throughout the codebase.
```python
# REAL EXAMPLE: ValkeyClient with 200+ defensive checks
class ValkeyClient:
def __init__(self):
try:
self._init_connections()
self.valkey_available = True
except Exception:
self.valkey_available = False
def get(self, key: str) -> Optional[Any]:
if not self.valkey_available: # One of 200+ checks!
return None
return self.valkey.get(key)
```
**Impact**:
- Every operation has defensive check overhead
- Infrastructure failures silently masked as "feature disabled"
- Dead code paths when component is actually required
- False sense of "graceful degradation"
**Fix**: Remove availability tracking entirely. If component is required, fail at initialization. The "graceful degradation" never actually helps - it just delays the inevitable failure.
### 3. Silent Success Claims on Failure (DECEPTIVE)
**Pattern**: Returning `{"success": True, "value": None}` when infrastructure fails.
```python
# REAL EXAMPLE: Calendar configuration endpoint
def get_calendar_config(self, user_id: str) -> Dict:
try:
config = credential_service.get_credential(user_id, "calendar_url")
return {"success": True, "calendar_url": config}
except Exception:
# Vault down? Claim success anyway!
return {"success": True, "calendar_url": None, "message": "Not configured"}
```
**Impact**: Client cannot distinguish "user hasn't configured calendar" from "credential service isUse this agent when you need to rename classes, methods, functions, or variables in code files to align with specific naming requirements or conventions. Examples: <example>Context: User wants to clean up function names by removing a specific prefix. user: 'Please remove the prefix get_ from all function names in this file' assistant: 'I'll use the code-renamer agent to systematically rename all functions by removing the get_ prefix' <commentary>The user wants systematic renaming of functions, which is exactly what the code-renamer agent is designed for.</commentary></example> <example>Context: User wants to standardize method naming conventions. user: 'Can you rename all the camelCase methods to snake_case in this class?' assistant: 'I'll use the code-renamer agent to convert all camelCase method names to snake_case convention' <commentary>This is a systematic renaming task that requires careful attention to naming conventions.</commentary></example>
Use this agent when the user makes offhanded comments, rhetorical questions, or expresses wishes about understanding something better. Trigger on patterns like:\n\n<example>\nContext: User is reviewing code and sees an assistant's explanation about how a function works.\nuser: "Can you believe this? Is this even right?"\nassistant: "Let me use the investigative-sidekick agent to verify if that explanation is accurate."\n<task tool_call to investigative-sidekick with context about what needs verification>\n</example>\n\n<example>\nContext: User is debugging and expresses frustration.\nuser: "I wish I could figure out what's causing this memory leak in the session handler"\nassistant: "I'll use the investigative-sidekick agent to investigate the root cause of that memory leak."\n<task tool_call to investigative-sidekick with the specific problem to investigate>\n</example>\n\n<example>\nContext: User reads a commit message claiming a performance improvement.\nuser: "Did this actually make things faster though?"\nassistant: "Let me launch the investigative-sidekick agent to verify that performance claim."\n<task tool_call to investigative-sidekick to fact-check the performance assertion>\n</example>\n\n<example>\nContext: User is reviewing documentation that seems questionable.\nuser: "This doesn't seem right - are we really supposed to use sync calls in async contexts?"\nassistant: "I'm going to use the investigative-sidekick agent to investigate whether that's actually correct."\n<task tool_call to investigative-sidekick to verify the technical claim>\n</example>\n\nActivate proactively when the user:\n- Questions accuracy or truthfulness ("Can you believe...", "Is this right?", "Really?")\n- Expresses wishes about understanding ("I wish I could figure out...", "I'd love to know...")\n- Shows skepticism ("Did this actually...", "Does this really...")\n- Makes rhetorical questions that imply investigation ("What's causing...", "Why is this...")\n- Doubts explanations or documentation they're reading
Control thinking token limits via environment variable
Run complete two-agent validation on module+tests (contract extraction + test validation). Binary pass/fail with specific issues.
Check Python logging levels and patterns for correctness. Focus on identifying wrong severity levels and missing exception handling. Use when reviewing code quality.
Detect explicit user_id parameters in functions to identify potential opportunities for using ambient context. This is an investigation tool that flags instances for human review, not a prescriptive analyzer.
DO NOT COMMIT unless user explicitly tells you to. Use this skill EVERY SINGLE TIME before creating a git commit. Provides mandatory commit message format, staging rules, and post-commit summary requirements for the MIRA project