Skip to main content
ClaudeWave
Skill7.9k estrellas del repoactualizado 3d ago

evaluate

The evaluate skill compares baseline and new implementation results after Phase 4 completion, generating machine-readable reports in `result.json`, `experiments.json`, and `comparison.json`. It collects metrics from `log.json`, determines a verdict (BETTER, WORSE, INCONCLUSIVE, or FAILED) based on key metric performance, and produces a structured JSON report with detailed summary, explanation, and metric-by-metric comparison including differences and trade-offs.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/Upsonic/Upsonic /tmp/evaluate && cp -r /tmp/evaluate/src/upsonic/prebuilt/applied_scientist/template/skills/evaluate ~/.claude/skills/evaluate
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Evaluate Skill

## Purpose
Compare baseline and new implementation results. Produce the machine-readable final report `result.json`, update `experiments.json`, and append a row to `comparison.json`.

## When to Use
Phase 5 — after the new implementation is complete and metrics are collected.

## Input
| Parameter | Type | Description |
|-----------|------|-------------|
| experiment_path | path | `experiments/{research_name}/` |
| research_name | string | Name of this experiment |

## Actions

1. **Collect all metrics** from `log.json` (Phase 3 baseline entry + Phase 4 new method entry).

2. **Determine verdict:**
   - `BETTER`: new method outperforms baseline on the majority of key metrics
   - `WORSE`: new method underperforms baseline on the majority of key metrics
   - `INCONCLUSIVE`: mixed results or differences within noise margin
   - `FAILED`: experiment could not produce comparable results (dependency failure, implementation crash, data incompatibility)

3. **Write `{experiment_path}/result.json`** in the exact schema below. Always valid JSON; never leave fields undefined — use `null` for unknown values.

   ```json
   {
     "name": "{research_name}",
     "verdict": "BETTER",
     "summary": "2-3 paragraphs explaining what the new method does, how it fundamentally differs from the baseline, and what trade-offs it makes.",
     "explanation": "2-3 sentences explaining WHY this verdict was reached. Reference specific metrics and their differences. Be concrete — mention numbers, not vague statements.",
     "comparison": {
       "metrics": [
         {
           "name": "accuracy",
           "current": 0.853,
           "new":     0.872,
           "diff":    0.019,
           "diff_display": "+0.019",
           "unit": null,
           "higher_is_better": true,
           "better": "new"
         },
         {
           "name": "training_time_seconds",
           "current": 2.0,
           "new":     45.0,
           "diff":    43.0,
           "diff_display": "+43.0",
           "unit": "seconds",
           "higher_is_better": false,
           "better": "current"
         }
       ]
     },
     "file_locations": {
       "current_notebook":   "experiments/{research_name}/current.ipynb",
       "current_data":       "experiments/{research_name}/current_data/",
       "new_notebook":       "experiments/{research_name}/new.ipynb",
       "research_source":    "experiments/{research_name}/research.pdf",
       "experiment_log":     "experiments/{research_name}/log.json"
     }
   }
   ```

   ### Field rules
   - `verdict`: exactly one of `"BETTER"`, `"WORSE"`, `"INCONCLUSIVE"`, `"FAILED"`.
   - `summary` / `explanation`: plain text, no markdown headings. Short paragraphs only.
   - `comparison.metrics[]`:
     - `current` / `new` are numbers (or `null` if a side could not compute the metric).
     - `diff = new - current` (raw number). `diff_display` is the short string with sign (`"+0.019"`, `"-0.03"`).
     - `better`: `"new"` | `"current"` | `"tie"` | `null` — computed from `diff` and `higher_is_better`.
     - `unit` is a short unit string (`"seconds"`, `"%"`, etc.) or `null`.
   - `file_locations` uses paths relative to the experiments directory root. `research_source` must match whatever Phase 0 materialized — `research.pdf`, `research_source.{ext}`, or the `research_source/` directory for a cloned repo.

4. **Update `experiments/experiments.json`:**
   - Set `status` to `"completed"` (or `"failed"` if the experiment failed).
   - Fill in `verdict`, `key_metric`, `baseline_model`, `new_method`.
   - `key_metric` is an object: `{"name": "...", "baseline": <num>, "new": <num>}`.

5. **Update `experiments/comparison.json`:**
   - If the file does not exist, create it with `{"experiments": []}`.
   - Append an entry:
     ```json
     {
       "name": "{research_name}",
       "date": "YYYY-MM-DD",
       "baseline": "{baseline_model}",
       "new_method": "{new_method}",
       "key_metric": {"name": "accuracy", "baseline": 0.853, "new": 0.872},
       "verdict": "BETTER"
     }
     ```

6. **Update `{experiment_path}/log.json`** — append a Phase 5 entry:
   ```json
   {
     "name": "Phase 5: Evaluate",
     "completed_at": "2026-04-17T11:40:00Z",
     "verdict": "BETTER",
     "key_change": "accuracy +0.019 (new > current)",
     "files_written": ["result.json", "experiments.json", "comparison.json"]
   }
   ```

## Output
- `{experiment_path}/result.json` — the final machine-readable report.
- `experiments/experiments.json` — updated with this experiment's final verdict.
- `experiments/comparison.json` — new row appended.
- `log.json` — finalized with Phase 5 entry.
unittest-generatorSubagent

Use this agent when you need to create unit tests for your code in unittest.TestCase format, organized in a tests folder with concept-based subfolders. Examples: <example>Context: User has just written a new authentication module and needs comprehensive unit tests. user: 'I just finished writing my user authentication functions in auth.py. Can you help me create unit tests for them?' assistant: 'I'll use the unittest-generator agent to create comprehensive unit tests for your authentication module.' <commentary>Since the user needs unit tests created for their authentication code, use the unittest-generator agent to create properly structured tests in the tests folder with appropriate subfolder organization.</commentary></example> <example>Context: User has implemented new data validation functions and wants to ensure they're properly tested. user: 'I've added several validation functions to my utils.py file. I need unit tests to make sure they handle edge cases correctly.' assistant: 'Let me use the unittest-generator agent to create thorough unit tests for your validation functions.' <commentary>The user needs unit tests for their validation functions, so use the unittest-generator agent to create comprehensive tests with edge case coverage.</commentary></example>

analyze_currentSkill
benchmarkSkill
experiment_managementSkill
implementSkill
progressSkill
researchSkill
code-reviewSkill

Perform structured code reviews with actionable feedback. Use when a user asks to review code, check code quality, find bugs, audit security, improve performance, or assess maintainability. Trigger when user says things like "review this code", "check for bugs", "is this code secure", "any issues with this", "code quality check", or pastes code asking for feedback. Also trigger for pull request reviews and pre-merge code checks. Do NOT trigger for writing new code from scratch, refactoring requests without review context, or general programming questions.