benchmark
The benchmark skill defines comparison metrics and extracts baseline values from the current implementation during Phase 3 of an experiment workflow. It reads metric outputs from the current analysis notebook, identifies any missing computations by setting a `needs_computation` flag, and appends a structured Phase 3 entry to the experiment's log.json file so downstream evaluation phases can reference and compare against established baseline performance.
git clone --depth 1 https://github.com/Upsonic/Upsonic /tmp/benchmark && cp -r /tmp/benchmark/src/upsonic/prebuilt/applied_scientist/template/skills/benchmark ~/.claude/skills/benchmarkSKILL.md
# Benchmark Skill
## Purpose
Define the comparison metrics and extract baseline values from the current implementation. Record them as a structured JSON entry so downstream phases and final evaluation can read them directly.
## When to Use
Phase 3 — after both current analysis and research analysis are complete.
## Input
| Parameter | Type | Description |
|-----------|------|-------------|
| experiment_path | path | `experiments/{research_name}/` |
## Actions
1. **Define comparison metrics:**
- Include ALL metrics already used in `current.ipynb`.
- Add any additional metrics that are relevant for the new method.
- For classification: accuracy, precision, recall, F1, AUC-ROC (as applicable).
- For regression: MSE, RMSE, MAE, R² (as applicable).
- Include training time if measurable.
2. **Extract baseline values:**
- Read metric values from `current.ipynb` output cells.
- If a metric is not computed in the notebook, record it as `null` and set `"needs_computation": true` — both notebooks must then compute it.
3. **Append a Phase 3 entry to `{experiment_path}/log.json`** under `phases`:
```json
{
"name": "Phase 3: Benchmark",
"completed_at": "2026-04-17T10:45:00Z",
"metrics": [
{
"name": "accuracy",
"description": "Fraction of correctly classified samples.",
"higher_is_better": true,
"baseline": 0.8726,
"needs_computation": false
},
{
"name": "f1",
"description": "F1 score (binary, positive class).",
"higher_is_better": true,
"baseline": 0.7277,
"needs_computation": false
},
{
"name": "roc_auc",
"description": "Area under the ROC curve.",
"higher_is_better": true,
"baseline": 0.9274,
"needs_computation": false
},
{
"name": "training_time_seconds",
"description": "Wall-clock training time.",
"higher_is_better": false,
"baseline": null,
"needs_computation": true
}
],
"notes": "training_time_seconds must be added to both notebooks for a fair comparison."
}
```
Do not overwrite earlier entries; append to the `phases` array.
## Output
- `{experiment_path}/log.json` — updated with Phase 3 benchmark entry
- Clear list (in `metrics`) of what the new implementation must computeUse this agent when you need to create unit tests for your code in unittest.TestCase format, organized in a tests folder with concept-based subfolders. Examples: <example>Context: User has just written a new authentication module and needs comprehensive unit tests. user: 'I just finished writing my user authentication functions in auth.py. Can you help me create unit tests for them?' assistant: 'I'll use the unittest-generator agent to create comprehensive unit tests for your authentication module.' <commentary>Since the user needs unit tests created for their authentication code, use the unittest-generator agent to create properly structured tests in the tests folder with appropriate subfolder organization.</commentary></example> <example>Context: User has implemented new data validation functions and wants to ensure they're properly tested. user: 'I've added several validation functions to my utils.py file. I need unit tests to make sure they handle edge cases correctly.' assistant: 'Let me use the unittest-generator agent to create thorough unit tests for your validation functions.' <commentary>The user needs unit tests for their validation functions, so use the unittest-generator agent to create comprehensive tests with edge case coverage.</commentary></example>
Perform structured code reviews with actionable feedback. Use when a user asks to review code, check code quality, find bugs, audit security, improve performance, or assess maintainability. Trigger when user says things like "review this code", "check for bugs", "is this code secure", "any issues with this", "code quality check", or pastes code asking for feedback. Also trigger for pull request reviews and pre-merge code checks. Do NOT trigger for writing new code from scratch, refactoring requests without review context, or general programming questions.