arbor
The Arbor skill autonomously improves concrete artifacts like code, prompts, or pipelines through iterative experimentation using Hypothesis Tree Refinement. Use it when optimizing performance against a measurable objective over many trials while avoiding overfitting, such as raising model accuracy, improving agent harnesses, tuning data pipelines, or solving benchmark optimization tasks where cumulative learning across experiments matters more than a single fix.
git clone --depth 1 https://github.com/K-Dense-AI/scientific-agent-skills /tmp/arbor && cp -r /tmp/arbor/skills/arbor ~/.claude/skills/arborSKILL.md
# Arbor — Autonomous Optimization via Hypothesis Tree Refinement
## Overview
This skill runs an **Autonomous Optimization (AO)** loop: starting from an existing artifact and a measurable objective, improve it through many rounds of experiment and evaluation — without step-by-step human supervision and without overfitting to the feedback signal. It's the right tool when the bottleneck isn't writing one good change, but *organizing dozens of trials* so that lessons accumulate instead of evaporating.
It implements **Hypothesis Tree Refinement (HTR)** from *Arbor* (Jin et al., 2026). The key idea: keep the research state in a persistent **hypothesis tree** rather than in conversation history. Each node binds a hypothesis, the distilled insight it produced, and a pointer to the artifact version that realizes it. You play the long-lived **coordinator** that owns this tree and decides where to search; short-lived **executor** subagents test one hypothesis each in isolated git worktrees and report back. A **held-out merge gate** admits a change only when it improves on a *test* evaluator the search never optimized against. This is what turns trial-and-error into cumulative, auditable research.
Use the `scripts/tree.py` state manager for all the bookkeeping (creating nodes, writing evidence, propagating insights, pruning, the merge gate, the Observe projection). It keeps the state consistent and frees you to spend judgment on what the evidence *means*.
## When to use this skill
Reach for Arbor when the task is **iterative improvement of a concrete artifact under an evaluator**:
- Model training: optimizer/architecture/recipe changes to lower loss or hit a target in fewer steps.
- Harness/agent engineering: raising pass rate or accuracy of an agent loop, search harness, or tool-use scaffold.
- Data synthesis: improving a generation/filtering pipeline judged by downstream model behavior.
- Benchmark optimization: MLE-bench / Kaggle-style "improve the submission" tasks.
- Prompt/system optimization where you can score outputs automatically.
The distinguishing signals: there's an **artifact you can modify**, an **objective**, a way to **score** candidates, and you expect to run **many experiments**. If the user only wants a single fix or a one-shot answer, this is overkill — just do the work directly. If they want open-ended ideation with no evaluator, use `hypothesis-generation` or `scientific-brainstorming` instead.
## The AO setup — pin this down first
Before any experiments, establish the task tuple `(M_0, O, E_dev, E_test)`. Getting this right matters more than any later decision, so confirm it explicitly:
- **M_0 — initial material**: the artifact to improve (a repo, a script, a config, a prompt). Make sure it's under git and currently runs.
- **O — objective**: the natural-language goal and the metric *direction* (maximize accuracy? minimize loss/steps?).
- **E_dev — development evaluator**: a command you can run freely during search to score a candidate. Fast, repeatable.
- **E_test — held-out test evaluator**: a *separate* evaluator (different seeds, different split, or a larger run) used only at the merge gate. It must not be used as a search oracle — that's the whole point.
If the user hasn't given you a clean dev/test split, **construct one and say so**. The dev/test separation is the mechanism that catches overfitting: a candidate that wins on dev but not on test isn't a success, it's a warning that you're exploiting the feedback signal. Without it, autonomous search reliably overfits.
Initialize the run:
```bash
python scripts/tree.py init \
--objective "Improve BrowseComp answer accuracy on the search harness" \
--dev-eval "python eval.py --split dev --n 50" \
--test-eval "python eval.py --split test --n 300" \
--material "." --metric-direction max --branching 3 --max-depth 2 --budget 12
```
`--branching` is how many sibling hypotheses you propose per parent; `--max-depth 2` keeps directions at depth 1 and concrete interventions at depth 2 (the paper's default); `--budget` is the number of coordinator cycles. Start small (10–20 cycles) — structured search beats brute force, and you can extend if progress is still being made.
## The coordinator loop
You run repeated cycles of six steps. This is the heart of HTR; do not collapse it into ad-hoc editing. Run `python scripts/tree.py cycle` once per cycle to track the budget.
### 1. Observe
Begin every cycle by re-grounding in the tree, not in your memory of the conversation:
```bash
python scripts/tree.py observe
```
This prints the objective, global insights, the active frontier (selectable hypotheses), executed nodes with their evidence, pruned lessons (negative constraints), and the current best artifact. Treating the tree as the source of truth is what keeps you coherent over a long run, after context compression has thrown away the details.
### 2. Ideate
Pick a promising parent and propose a few child hypotheses under it. **Condition on the tree's evidence** — this is the difference between Arbor and random search:
- Validated insights are assumptions you can build on.
- Pruned nodes are dead ends to avoid.
- A "half-right" result is a *starting point for a sharper hypothesis*, not a reason to abandon the direction.
Each hypothesis should be a **falsifiable claim about how changing the artifact will move the metric**, not a vague intention. Depth-1 nodes are broad directions ("the search harness loses correct answers it already retrieved"); depth-2 nodes are concrete, executable interventions ("run K=5 independent rollouts and aggregate by evidence dossier instead of majority vote").
```bash
python scripts/tree.py add-node --parent n0 --hypothesis "Verification, not retrieval, is the bottleneck: candidates are found but discarded"
python scripts/tree.py add-node --parent n4 --hypothesis "Decompose the question into atomic constraints and verify each independently"
```
### 3. Select
Choose which pending lHow to use the Adaptyv Bio Foundry API and Python SDK for protein experiment design, submission, and results retrieval. Use this skill whenever the user mentions Adaptyv, Foundry API, protein binding assays, protein screening experiments, BLI/SPR assays, thermostability assays, or wants to submit protein sequences for experimental characterization. Also trigger when code imports `adaptyv`, `adaptyv_sdk`, or `FoundryClient`, or references `foundry-api-public.adaptyvbio.com`.
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.
Infer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets.
Core Python library for astronomy and astrophysics workflows that need Astropy APIs, including units/quantities, coordinates, FITS I/O, tables, time systems, WCS, and cosmology. Use when implementing or debugging astronomical data analysis code with Astropy.
Observe the user's screen via screenpipe, detect repeated research workflows, match them against existing scientific-agent-skills, and draft new skills (or composition recipes that chain existing ones) for the patterns not yet covered. Use when the user asks to analyze their recent work and propose skills based on what they actually do. Requires the screenpipe daemon (https://github.com/screenpipe/screenpipe) running locally on port 3030 — the skill has no other data source and will refuse to run if screenpipe is unreachable. All detection runs locally; only redacted cluster summaries reach the LLM.
Benchling Python SDK and REST API integration for registry entities, inventory, ELN entries, workflows, Benchling Apps, and Data Warehouse queries. Use when automating lab data with benchling-sdk or the v2 API.
Search scientific papers and retrieve structured experimental data extracted from full-text studies via the BGPT MCP server. Returns 25+ fields per paper including methods, results, sample sizes, quality scores, and conclusions. Use for literature reviews, evidence synthesis, and finding experimental details not available in abstracts alone.