test-driven-development
This Claude Code skill enforces test-driven development by requiring developers to write failing tests before implementing features or bug fixes. Use it for all new features, bug fixes, refactoring, and behavior changes to ensure the red-green-refactor cycle is followed, with production code only written after confirming a test fails correctly.
git clone --depth 1 https://github.com/shiwenwen/hope-agent /tmp/test-driven-development && cp -r /tmp/test-driven-development/skills/test-driven-development ~/.claude/skills/test-driven-developmentSKILL.md
# Test-Driven Development (TDD)
## Overview
Write the test first. Watch it fail. Write minimal code to pass.
**Core principle:** If you didn't watch the test fail, you don't know if it tests the right thing.
**Violating the letter of the rules is violating the spirit of the rules.**
## When to Use
**Always:**
- New features
- Bug fixes
- Refactoring
- Behavior changes
**Exceptions (ask the user first):**
- Throwaway prototypes
- Generated code
- Configuration files
Thinking "skip TDD just this once"? Stop. That's rationalization.
## The Iron Law
```
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
```
Write code before the test? Delete it. Start over.
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete
Implement fresh from tests. Period.
## Red-Green-Refactor Cycle
### RED — Write Failing Test
Write one minimal test showing what should happen.
**Good test:**
```python
def test_retries_failed_operations_3_times():
attempts = 0
def operation():
nonlocal attempts
attempts += 1
if attempts < 3:
raise Exception('fail')
return 'success'
result = retry_operation(operation)
assert result == 'success'
assert attempts == 3
```
Clear name, tests real behavior, one thing.
**Bad test:**
```python
def test_retry_works():
mock = MagicMock()
mock.side_effect = [Exception(), Exception(), 'success']
result = retry_operation(mock)
assert result == 'success' # What about retry count? Timing?
```
Vague name, tests mock not real code.
**Requirements:**
- One behavior per test
- Clear descriptive name ("and" in name? Split it)
- Real code, not mocks (unless truly unavoidable)
- Name describes behavior, not implementation
### Verify RED — Watch It Fail
**MANDATORY. Never skip.**
```bash
# Use exec tool to run the specific test
pytest tests/test_feature.py::test_specific_behavior -v
```
Confirm:
- Test fails (not errors from typos)
- Failure message is expected
- Fails because the feature is missing
**Test passes immediately?** You're testing existing behavior. Fix the test.
**Test errors?** Fix the error, re-run until it fails correctly.
### GREEN — Minimal Code
Write the simplest code to pass the test. Nothing more.
**Good:**
```python
def add(a, b):
return a + b # Nothing extra
```
**Bad:**
```python
def add(a, b):
result = a + b
logging.info(f"Adding {a} + {b} = {result}") # Extra!
return result
```
Don't add features, refactor other code, or "improve" beyond the test.
**Cheating is OK in GREEN:**
- Hardcode return values
- Copy-paste
- Duplicate code
- Skip edge cases
We'll fix it in REFACTOR.
### Verify GREEN — Watch It Pass
**MANDATORY.**
```bash
# Run the specific test
pytest tests/test_feature.py::test_specific_behavior -v
# Then run ALL tests to check for regressions
pytest tests/ -q
```
Confirm:
- Test passes
- Other tests still pass
- Output pristine (no errors, warnings)
**Test fails?** Fix the code, not the test.
**Other tests fail?** Fix regressions now.
### REFACTOR — Clean Up
After green only:
- Remove duplication
- Improve names
- Extract helpers
- Simplify expressions
Keep tests green throughout. Don't add behavior.
**If tests fail during refactor:** Undo immediately. Take smaller steps.
### Repeat
Next failing test for next behavior. One cycle at a time.
## Why Order Matters
**"I'll write tests after to verify it works"**
Tests written after code pass immediately. Passing immediately proves nothing:
- Might test the wrong thing
- Might test implementation, not behavior
- Might miss edge cases you forgot
- You never saw it catch the bug
Test-first forces you to see the test fail, proving it actually tests something.
**"I already manually tested all the edge cases"**
Manual testing is ad-hoc. You think you tested everything but:
- No record of what you tested
- Can't re-run when code changes
- Easy to forget cases under pressure
- "It worked when I tried it" ≠ comprehensive
Automated tests are systematic. They run the same way every time.
**"Deleting X hours of work is wasteful"**
Sunk cost fallacy. The time is already gone. Your choice now:
- Delete and rewrite with TDD (high confidence)
- Keep it and add tests after (low confidence, likely bugs)
The "waste" is keeping code you can't trust.
**"TDD is dogmatic, being pragmatic means adapting"**
TDD IS pragmatic:
- Finds bugs before commit (faster than debugging after)
- Prevents regressions (tests catch breaks immediately)
- Documents behavior (tests show how to use code)
- Enables refactoring (change freely, tests catch breaks)
"Pragmatic" shortcuts = debugging in production = slower.
**"Tests after achieve the same goals — it's spirit not ritual"**
No. Tests-after answer "What does this do?" Tests-first answer "What should this do?"
Tests-after are biased by your implementation. You test what you built, not what's required. Tests-first force edge case discovery before implementing.
## Common Rationalizations
| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
| "Need to explore first" | Fine. Throw away exploration, start with TDD. |
| "Test hard = design unclear" | Listen to the test. Hard to test = hard to use. |
| "TDD will slow me down" | TDD faster than debugging. Pragmatic = test-first. |
| "Manual test faster" | Manual doesn't prove edge cases. You'll re-test every change. |
| "Existing cod>
Use when the user asks to draft, polish, translate, or reply to an email. Produces a clean draft with subject line, greeting, body, and sign-off, plus a pre-send self-check.
Use when the user mentions 飞书 / Feishu / Lark workspace operations: docx (云文档) read/write, bitable (多维表格) records / views / dashboards, drive (云盘) upload/download, wiki (知识库) link resolution, approval (审批) instance create/cancel/query, calendar (日历) event create/list/update + attendees, contact (联系人) user/department lookup, hire (招聘) job/talent/application listing. Trigger on phrases like 'OKR 周报', '把这份文档发到飞书云盘', '给团队拉个评审会议', '查 [姓名] 的联系方式', '撤销那条审批', '/wiki 链接', or any request that mentions a feishu / lark URL / token (doxcn.../bascn.../wikcn.../boxcn.../om_...).
Hope Agent browser automation — the standard `status → tabs → snapshot → act` loop, stale-ref recovery rules, and what to do when login / 2FA / captcha / camera-prompt / dialog blocks progress. Load this skill whenever you reach for the `browser` tool. Trigger on: user asks the agent to open / control / click / scrape / log into / verify something in a web app ('open X and click Y', '打开 X 然后点击 Y', 'log into my Gmail', 'scrape this page', 'fill out the form on X'); user reports a flow that requires real browser context (cookies, JS-rendered content, OAuth).
Discover and install third-party skills from external registries when the user needs a capability that no currently-active skill covers. Trigger when: (1) the user explicitly asks 'find a skill for X', 'is there a skill that does X', 'install a skill to X', (2) the user requests a well-known integration (Slack, Notion, Trello, GitHub, Hue, Sonos, iMessage, weather, TTS, transcription …) that isn't in the active skill catalog, (3) you are about to hand-write ad-hoc shell / API code for a domain that almost certainly has a published skill. Do NOT trigger if an active skill already covers the need — scan the visible skill catalog first.
Self-service diagnostics — query Hope Agent's local SQLite databases (logs / sessions / async jobs) directly via the `exec` tool to investigate problems, analyze usage, and locate root causes. Trigger on: user reports something broken / failing / slow / stuck / not responding ('X 不工作', 'X 报错', 'X 卡住', '为什么 X 失败', 'why did X fail', 'show me the logs', 'check what happened'); ad-hoc data analysis ('this week's token usage', '最近调用最多的工具', 'how many subagent runs failed', 'tool error rate', 'find sessions where X happened'); verifying a fix ('did the error stop after I changed Y'). Use BEFORE asking the user to paste log snippets — the data is on disk, query it directly. Read-only — SELECT only, never UPDATE/DELETE/INSERT/DROP.
Hope Agent native macOS desktop control — the standard `mac_control` status / diagnostics / apps / dock / spaces / snapshot / visual / windows / menu / clipboard / dialog loop, target-first action rules, no-blind-coordinate policy, and recovery for stale AX/window/menu/dialog state. Load whenever using `mac_control`, or when the user asks to control local Mac apps, Dock, Spaces, click/type/menu/window/dialog/clipboard, automate Finder/TextEdit/System Settings, visually locate UI, or says 控制 Mac, macOS 自动化, 点按钮, 打开应用, Dock, Space, 关闭窗口, 菜单点击, 视觉定位.
Self-understanding and issue reporting for Hope Agent itself. Use when the user asks how Hope Agent works internally, asks about its own source code/docs/runtime behavior, reports a bug/failure/slowness/crash, asks to diagnose logs, or asks to create/submit a GitHub issue for a bug, feature request, or improvement (including when there is no bug). Chinese triggers: 自查, 了解自己, 自我诊断, 排查 Hope Agent, 提交 issue, 需求 issue, 功能改进.