Skill473 estrellas del repoactualizado 20d ago

fail-fast-no-hedging

# fail-fast-no-hedging This Claude Code skill audits Python codebases to identify component hedging anti-patterns where required infrastructure failures are masked as normal operation through defensive error handling. Use it during code reviews and architectural assessments to enforce fail-fast principles, ensuring systems fail loudly when critical dependencies break rather than degrading silently, which improves production reliability and operational visibility by making infrastructure failures immediately detectable.

Ver fuente Repositorio: mira-OSS

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/taylorsatula/mira-OSS /tmp/fail-fast-no-hedging && cp -r /tmp/fail-fast-no-hedging/.claude/skills/fail-fast-no-hedging ~/.claude/skills/fail-fast-no-hedging

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Fail-Fast Engineering: Architectural Honesty Over Silent Degradation

## 🚨 Core Principle

**Hedging** is treating **required** infrastructure as optional through defensive try/except blocks that mask failures as normal operation.

**The Critical Question:** Is this component required for the system to work correctly, or is it a genuine optional enhancement?

When **required** infrastructure fails, the system MUST fail loudly. When **optional** features fail, logging and continuing may be appropriate.

## 📋 Assessment Output Format (REQUIRED)

When analyzing a codebase, conclude with this structured list of all hedging issues found:

```markdown
## Hedging Anti-Patterns Found in [directory/module name]

**N files with component hedging violations:**

1. **filename.py** (SEVERITY)
   - Location: `function_name()` lines X-Y
   - Issue: Brief description of what's wrong
   - Hedging: Explanation of how it's treating required infrastructure as optional
   - Impact: What happens in production when infrastructure fails

2. **filename.py** (SEVERITY)
   - Location: `function_name()` lines X-Y
   - Issue: Brief description of what's wrong
   - Hedging: Explanation of how it's treating required infrastructure as optional
   - Impact: What happens in production when infrastructure fails
```

**Example (from lt_memory/ audit):**

```markdown
## Hedging Anti-Patterns Found in lt_memory/

**3 files with component hedging violations:**

1. **db_access.py** (HIGH SEVERITY)
   - Location: `get_or_create_entity()` lines 916-933
   - Issue: Returns `None` when database INSERT...RETURNING fails, masking query failure as "entity not found"
   - Hedging: Docstring declares return type as `Entity`, implementation silently returns `None` on infrastructure failure
   - Impact: Caller cannot distinguish between "entity doesn't exist" (legitimate) vs "database query failed" (infrastructure down)

2. **extraction.py** (HIGH SEVERITY)
   - Location: `_parse_extraction_response()` lines 396-397 and 430
   - Issue: Returns `[]` (empty list) when JSON parsing fails, instead of raising as docstring declares
   - Hedging: Docstring says "Raises: ValueError" but code returns `[]` on parse failures
   - Impact: Caller sees empty list and treats it as "no memories extracted" when actually LLM returned invalid JSON
```

This format makes all issues immediately scannable before remediation work begins.

---

## 🎯 Quick-Start Guide

When analyzing a new codebase:

1. **Identify Infrastructure Dependencies**
   - Database connections
   - Cache/session stores
   - External APIs
   - Message queues
   - File systems

2. **Apply Three Diagnostic Tests**
   - **Semantic Distinction Test**: Can you distinguish "no data" from "infrastructure down"?
   - **Never Executes Test**: Will this fallback realistically run during normal operation?
   - **Contract Match Test**: Does behavior match the docstring/type hints?

3. **Look for These Red Flags**
   ```
   except Exception: return []     # Infrastructure failure → empty data
   except Exception: return False  # Connection error → "not allowed"
   except Exception: return None   # Database down → "not found"
   ```

## 🌍 Real-World Patterns from Production Codebases

Based on systematic removal of 40+ hedging anti-patterns across production systems, these specific patterns emerge repeatedly:

### 1. Infrastructure Failures Converted to Client Errors (CRITICAL)

**Pattern**: Catching database/service failures and raising ValidationError (400) instead of letting them propagate as 500s.

```python
# REAL EXAMPLE: CNS API Layer
def execute_action(self, action: str, data: Dict) -> Dict:
    try:
        session_manager = get_shared_session_manager()
        lt_db = LTMemoryDB(session_manager)
    except Exception as e:
        # Database down converted to "your input is invalid"!
        if "connection" in str(e) or "database" in str(e):
            raise ValidationError(f"Database connection failed: {e}")
```

**Impact**: Users see "Bad Request" when database is down. Monitoring doesn't alert (watches 500s, not 400s). Operators think users are sending bad data while infrastructure burns.

**Fix**: Remove business-layer exception translation. Let infrastructure exceptions bubble to API boundary where proper HTTP status translation happens.

### 2. The Availability Flag Plague (PERVASIVE)

**Pattern**: Setting `self.component_available = False` during init, then checking it hundreds of times throughout the codebase.

```python
# REAL EXAMPLE: ValkeyClient with 200+ defensive checks
class ValkeyClient:
    def __init__(self):
        try:
            self._init_connections()
            self.valkey_available = True
        except Exception:
            self.valkey_available = False

    def get(self, key: str) -> Optional[Any]:
        if not self.valkey_available:  # One of 200+ checks!
            return None
        return self.valkey.get(key)
```

**Impact**:
- Every operation has defensive check overhead
- Infrastructure failures silently masked as "feature disabled"
- Dead code paths when component is actually required
- False sense of "graceful degradation"

**Fix**: Remove availability tracking entirely. If component is required, fail at initialization. The "graceful degradation" never actually helps - it just delays the inevitable failure.

### 3. Silent Success Claims on Failure (DECEPTIVE)

**Pattern**: Returning `{"success": True, "value": None}` when infrastructure fails.

```python
# REAL EXAMPLE: Calendar configuration endpoint
def get_calendar_config(self, user_id: str) -> Dict:
    try:
        config = credential_service.get_credential(user_id, "calendar_url")
        return {"success": True, "calendar_url": config}
    except Exception:
        # Vault down? Claim success anyway!
        return {"success": True, "calendar_url": None, "message": "Not configured"}
```

**Impact**: Client cannot distinguish "user hasn't configured calendar" from "credential service is

Del mismo repositorio

code-renamerSubagent

Use this agent when you need to rename classes, methods, functions, or variables in code files to align with specific naming requirements or conventions. Examples: <example>Context: User wants to clean up function names by removing a specific prefix. user: 'Please remove the prefix get_ from all function names in this file' assistant: 'I'll use the code-renamer agent to systematically rename all functions by removing the get_ prefix' <commentary>The user wants systematic renaming of functions, which is exactly what the code-renamer agent is designed for.</commentary></example> <example>Context: User wants to standardize method naming conventions. user: 'Can you rename all the camelCase methods to snake_case in this class?' assistant: 'I'll use the code-renamer agent to convert all camelCase method names to snake_case convention' <commentary>This is a systematic renaming task that requires careful attention to naming conventions.</commentary></example>

investigative-sidekickSubagent

Use this agent when the user makes offhanded comments, rhetorical questions, or expresses wishes about understanding something better. Trigger on patterns like:\n\n<example>\nContext: User is reviewing code and sees an assistant's explanation about how a function works.\nuser: "Can you believe this? Is this even right?"\nassistant: "Let me use the investigative-sidekick agent to verify if that explanation is accurate."\n<task tool_call to investigative-sidekick with context about what needs verification>\n</example>\n\n<example>\nContext: User is debugging and expresses frustration.\nuser: "I wish I could figure out what's causing this memory leak in the session handler"\nassistant: "I'll use the investigative-sidekick agent to investigate the root cause of that memory leak."\n<task tool_call to investigative-sidekick with the specific problem to investigate>\n</example>\n\n<example>\nContext: User reads a commit message claiming a performance improvement.\nuser: "Did this actually make things faster though?"\nassistant: "Let me launch the investigative-sidekick agent to verify that performance claim."\n<task tool_call to investigative-sidekick to fact-check the performance assertion>\n</example>\n\n<example>\nContext: User is reviewing documentation that seems questionable.\nuser: "This doesn't seem right - are we really supposed to use sync calls in async contexts?"\nassistant: "I'm going to use the investigative-sidekick agent to investigate whether that's actually correct."\n<task tool_call to investigative-sidekick to verify the technical claim>\n</example>\n\nActivate proactively when the user:\n- Questions accuracy or truthfulness ("Can you believe...", "Is this right?", "Really?")\n- Expresses wishes about understanding ("I wish I could figure out...", "I'd love to know...")\n- Shows skepticism ("Did this actually...", "Does this really...")\n- Makes rhetorical questions that imply investigation ("What's causing...", "Why is this...")\n- Doubts explanations or documentation they're reading

thinkSlash Command

Control thinking token limits via environment variable

validate-moduleSlash Command

Run complete two-agent validation on module+tests (contract extraction + test validation). Binary pass/fail with specific issues.

Code Consistency - Logging & StandardsSkill

Check Python logging levels and patterns for correctness. Focus on identifying wrong severity levels and missing exception handling. Use when reviewing code quality.

contextvar-opportunity-finderSkill

Detect explicit user_id parameters in functions to identify potential opportunities for using ambient context. This is an investigation tool that flags instances for human review, not a prescriptive analyzer.

contextvar-remediationSkill

Git WorkflowSkill

DO NOT COMMIT unless user explicitly tells you to. Use this skill EVERY SINGLE TIME before creating a git commit. Provides mandatory commit message format, staging rules, and post-commit summary requirements for the MIRA project