coral-new-task
The coral-new-task skill provides a complete end-to-end guide for creating a new CORAL task, covering the three required components: task.yaml configuration, seed/ starter code directory, and grader/ packaged Python module with TaskGrader implementation. Use this when adding a new task to CORAL, porting an existing benchmark, or migrating legacy eval/grader.py examples to the packaged grader format, including common pitfalls like incorrect repo_path references, reversed score directions, and missing run() function signatures.
git clone --depth 1 https://github.com/Human-Agent-Society/CORAL /tmp/coral-new-task && cp -r /tmp/coral-new-task/.claude/skills/coral-new-task ~/.claude/skills/coral-new-taskSKILL.md
# Creating a new CORAL task
A CORAL task is **three things** that must line up:
```
examples/<task>/
├── task.yaml # config: name, description, grader entrypoint, agent count
├── seed/ # starter code agents see when they begin (the repo_path)
│ └── solution.py
└── grader/ # standalone Python package
├── pyproject.toml
└── src/<task>_grader/
├── __init__.py
└── grader.py # class Grader(TaskGrader): ...
```
The packaged form is the only supported form. The package gives the grader its own venv and ships everything the eval needs — grader code, helper modules, and hidden data (see "Hidden data" below).
## Reference implementations
Look at these before writing anything new — copy the closest one and edit:
| Reference | When to copy it |
|---|---|
| [examples/erdos/](examples/erdos/) | Minimal packaged grader, single grader file, numpy-only deps |
| [examples/dna_design/](examples/dna_design/) | Packaged grader with bundled data files (`importlib.resources`) and `[ml]` optional-deps for heavy libs |
| [examples/swebench-verified/](examples/swebench-verified/) | Tiered eval (different instance counts per tier), private answer keys, harbor integration |
| [examples/circle_packing/](examples/circle_packing/) | Smallest packaged task end-to-end — single solution file, single grader file |
| [examples/mnist/](examples/mnist/) | Packaged grader with a hidden answer key shipped inside the package (`taskdata/answers/`) |
## 1. The seed
Whatever lives in `seed/` is what the agent sees on first checkout — it's the working directory the grader will later score. The contract between `seed/` and the grader is the **program file**: a Python file with a function the grader imports and calls.
The convention across examples is:
- `solution.py` (or `initial_program.py`) defining a top-level `run()` function.
- The grader passes `program_file: "solution.py"` via `grader.args`.
- `run()`'s signature is whatever the grader expects — usually `() -> result` or `(input_path) -> result`.
Put a real, runnable baseline here. Agents should be able to `coral eval` immediately and get a non-zero score, so they have a starting point to improve. A no-op skeleton that crashes is not a good baseline.
If the task needs data files at runtime (training data, fixtures), put them under `seed/data/` and reference them by relative path from `solution.py`. The grader will see them at `<codebase_path>/data/...`.
## 2. The grader
### Packaged grader — the recommended path
```
grader/
├── pyproject.toml
└── src/<task>_grader/
├── __init__.py
└── grader.py
```
`pyproject.toml` is a thin Hatchling package. Crib from [examples/erdos/grader/pyproject.toml](examples/erdos/grader/pyproject.toml):
```toml
[project]
name = "<task>-grader"
version = "0.1.0"
description = "CORAL grader for the <task> task."
requires-python = ">=3.11"
dependencies = ["coral", "numpy"] # Whatever the grader actually imports.
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/<task>_grader"]
```
Subclass `TaskGrader` and implement `evaluate()`:
```python
# grader/src/<task>_grader/grader.py
from coral.grader import TaskGrader
from coral.types import ScoreBundle
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
program_file = self.args.get("program_file", "solution.py")
# self.codebase_path — the agent's commit checked out detached
# self.private_dir — .coral/private/ (your hidden answer keys live here)
# self.args — dict from task.yaml grader.args
# self.timeout — grader.timeout in seconds (or None)
# self.eval_logs_dir — write subprocess logs / artifacts the agent should see post-grade
try:
result = run_program_and_score(...)
except TimeoutError:
return self.fail(f"Evaluation timed out after {self.timeout}s")
except Exception as e:
return self.fail(f"Evaluation failed: {e}")
return self.score(result, explanation=f"score={result:.4f}")
```
What you have available on `self`:
| Attribute / method | Use it for |
|---|---|
| `self.codebase_path` | Path to the commit being graded (detached worktree). Read-only — anything written here is discarded after the eval. |
| `self.private_dir` | `.coral/private/`. Your answer keys, hidden test data, anything from `grader.private` lives here. |
| `self.args` | `dict` from `task.yaml::grader.args`. Use `self.args.get("program_file", "solution.py")` etc. |
| `self.timeout` | Eval timeout in seconds (or `None` if `grader.timeout: 0`). |
| `self.eval_logs_dir` | Per-attempt directory for logs/artifacts that should outlive the grader. Symlinked into each agent worktree as `<shared_dir>/eval_logs/<hash>/`. |
| `self.score(value, explanation=...)` | Build a single-task `ScoreBundle` from a numeric score. |
| `self.fail(reason)` | Return a fail `ScoreBundle` with `reason` as feedback. |
| `self.get_python_command()` | List for the `python` binary inside the codebase's env (uses `uv run` if a `pyproject.toml` is present). Always use this instead of `sys.executable` so task-specific deps are visible. |
| `self.run_program(filename, *args)` | Convenience: runs `<codebase_path>/<filename>` as a subprocess via `get_python_command()`. |
### Bundling data files with the grader
If the grader needs reference files (model weights, ground-truth answers, scoring fixtures), ship them inside the package and load via `importlib.resources`:
```python
import importlib.resources
scorer_dir = str(importlib.resources.files("<task>_grader.scorers"))
```
[examples/dna_design/grader/src/dna_design_grader/grader.py](examples/dna_design/grader/src/dna_design_grader/grader.py) is the canonical pattern — note the `scorers/` subpackage. Add the directory to `[tool.hatch.build.targets.wheel]` if it has non-Python files.
### Heavy / optional depenVerify and debug changes to CORAL itself — smallest reproduce loop per area (grader / daemon / CLI / hooks / manager / workspace / hub / template / config / web), where to look when something breaks (hung graders, agent restart loops, stalled agents, missing heartbeat actions, corrupted shared state, broken worktree symlinks, grader import errors, wrong-task resume), how to inspect a live or finished run under `.coral/public/`, and the canonical lint/test commands. Use when editing code under `coral/` or chasing a CORAL bug, NOT when adding a new task or extending the framework.
Add a new component to the CORAL framework itself — a new agent runtime under `coral/agent/builtin/` (claude_code/codex/cursor_agent style), a new CLI command in `coral/cli/`, a new bundled skill or subagent template under `coral/template/skills/` or `coral/template/agents/`, a new hook in `coral/hooks/`, a new field in `coral/config.py`, or a framework-level extension to the grader stack under `coral/grader/`. NOT for writing a per-task grader or adding an example task — use `coral-new-task` for that. NOT for debugging existing code — use `coral-debug`.
Research the problem domain before coding. Web search for techniques, save raw sources, write structured findings, update the index.
Organize the shared notes directory when it becomes hard to navigate. Restructure within research/ and experiments/, deduplicate, update index.md.
Autonomously create, test, and optimize skills by detecting reusable patterns in your own work. Use when you notice repeated tool sequences, recurring code patterns across attempts, or insights that should be captured as a packaged skill. Also use to benchmark and iterate on existing skills.