auto-experiment
auto-experiment launches an autonomous deep learning experiment loop that cycles through planning, executing, and reflecting on GPU training runs without human intervention. Use this skill when you have a research project with a `PROJECT_BRIEF.md` file and want an AI agent to iteratively refine experiments, run training jobs via GPU, monitor results at minimal cost, and automatically decide whether to iterate, pivot, or report findings based on memory of previous attempts.
git clone --depth 1 https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7 /tmp/auto-experiment && cp -r /tmp/auto-experiment/skills/auto-experiment ~/.claude/skills/auto-experimentSKILL.md
# auto-experiment Launch an autonomous experiment agent that runs your deep learning experiments 24/7. ## What This Does This skill starts a **THINK → EXECUTE → REFLECT** loop that: 1. Reads your `PROJECT_BRIEF.md` to understand the research goal 2. Analyzes previous results in `MEMORY_LOG.md` 3. Plans the next experiment (hypothesis + success criteria) 4. Implements code changes and runs a **mandatory dry-run** 5. Launches GPU training via `nohup` (tracks PID) 6. **Monitors at zero LLM cost** (only `kill -0 PID` + `tail log` + `nvidia-smi`) 7. Wakes up when training finishes to analyze results 8. Updates memory and decides: iterate, pivot, or report 9. Repeats ## Usage ``` Claude Code: /auto-experiment Claude Code: /auto-experiment --project /path/to/my_project --gpu 0 Claude Code: /auto-experiment --project . --max-cycles 5 Codex: $auto-experiment ``` ## Prerequisites The project directory must contain: ### `PROJECT_BRIEF.md` (required) A frozen reference describing your research goal. Example: ```markdown # Goal Train a ViT-B/16 on ImageNet to reach 78%+ top-1 accuracy. # Codebase - Training: train.py - Config: configs/vit_base.yaml - Data: /data/imagenet/ # Constraints - GPU 0-3 available (use DDP) - Max 90 epochs per run - Report val accuracy after each run # Current Best - ResNet-50 baseline: 76.1% ``` ### `config.yaml` (optional) Override default agent settings: ```yaml agent: provider: "anthropic" # or "openai" / "claude_cli" / "codex_cli" model: "claude-sonnet-4-6" base_url: "" # optional compatible endpoint override api_key_env: "" # optional custom key env var auth_token_env: "" # optional custom bearer token env var max_cycles: -1 # -1 = unlimited max_steps_per_cycle: 3 # max sub-agent dispatches per cycle cooldown_interval: 300 # 5 min smart polling memory: brief_max_chars: 3000 log_max_chars: 2000 monitor: poll_interval: 900 # check every 15 min during training zero_llm: true experiment: mandatory_dry_run: true ``` If the user wants a compatible API endpoint instead of the official Anthropic or OpenAI API, keep the same `provider` values and set `base_url` plus a custom `api_key_env`. Do not invent provider names like `qwen` or `glm`. Optional remote execution over SSH: ```yaml execution: mode: "ssh" ssh_host: "user@server" remote_workspace: "/home/user/my_project/workspace" remote_python: "python3" ``` In SSH mode, the controller state stays local (`PROJECT_BRIEF.md`, `workspace/MEMORY_LOG.md`, `workspace/HUMAN_DIRECTIVE.md`, `state.json`), while code edits, shell commands, training, log tailing, PID checks, and GPU queries run on the configured remote host. ## Workflow Details ### Phase 1: THINK - Read `PROJECT_BRIEF.md` (frozen, max 3000 chars) - Read `MEMORY_LOG.md` (rolling, auto-compacted) - Check for `HUMAN_DIRECTIVE.md` (highest priority, auto-archived after reading) - Analyze: What's the current best? What hasn't been tried? What's most promising? - Output: experiment plan with hypothesis and success criteria ### Phase 2: EXECUTE - Dispatch to Code Agent (5 tools: `run_shell`, `launch_experiment`, `write_file`, `read_file`, `list_files`) - Code Agent implements changes - **Mandatory dry-run** (2-step verify, abort if fails) - Launch training via `nohup`, capture PID - Enter zero-cost monitoring loop: - backend PID check — is process alive? - backend `nvidia-smi` — GPU utilization - backend `tail -50 logfile` — latest training output - **Zero LLM API calls during this phase** ### Phase 3: REFLECT - Parse training logs for metrics (loss, accuracy, FGD, FID, etc.) - Compare against previous best - Log milestone if improved (auto-compacted at 1200 chars) - Log decision (rolling last 15 entries) - Decide: try another config / pivot direction / generate report ### Human Override (anytime) ```bash # Drop a directive file — agent reads it next cycle with highest priority echo "Try learning rate 1e-5 with cosine schedule" > workspace/HUMAN_DIRECTIVE.md ``` ## Memory System Two-Tier, constant size (~5K chars / ~1500 tokens), no matter how long the agent runs: | Tier | File | Content | Cap | |------|------|---------|-----| | 1 | `PROJECT_BRIEF.md` | Frozen project reference | 3,000 chars | | 2 | `MEMORY_LOG.md` | Key Results + Recent Decisions | 2,000 chars | **Auto-compaction rules:** - Key Results: oldest dropped when section > 1,200 chars - Recent Decisions: only last 15 entries kept - Total log hard-capped at 2,000 chars ## Cost | Phase | Duration | LLM Cost | |-------|----------|----------| | THINK | 5-10 min | ~$0.05 | | EXECUTE (training) | hours/days | **$0.00** | | REFLECT | 5-10 min | ~$0.03 | | **24h cycle total** | | **~$0.08** | ## Example Output After a few cycles, your `workspace/MEMORY_LOG.md` will look like: ```markdown # Memory Log ## Key Results [04-07 14:30] Exp001: ResNet-50 baseline, lr=0.1, acc=76.1% [04-07 22:15] Exp002: ViT-B/16, lr=1e-3, acc=74.8% (underperforming, lr too high) [04-08 06:00] Exp003: ViT-B/16, lr=3e-4 + cosine, acc=77.9% (new best!) [04-08 14:45] Exp004: ViT-B/16, lr=3e-4 + cosine + mixup, acc=78.3% (target reached!) ## Recent Decisions [04-07 14:30] Start with ResNet-50 baseline to establish reference [04-07 22:15] ViT lr=1e-3 too high, try 3e-4 next [04-08 06:00] Cosine schedule helped significantly, try adding regularization [04-08 14:45] Target reached! Generate final report. ```
Experiment implementation, execution, and monitoring
Literature search and hypothesis formation
Central decision-maker that plans experiments and reflects on results
Report generation and paper writing
Search papers from top AI/ML conferences
Daily arXiv paper recommendations with automatic deduplication
Check status of running autonomous experiment loops
Check GPU status, running experiments, and available resources