baseline-establishment
Baseline-establishment systematically collects and standardizes performance data across machine learning methods through five coordinated strategies: method inventory, performance extraction, condition standardization, discrepancy analysis, and progress quantification. Use this skill when establishing state-of-the-art benchmarks, comparing competing approaches fairly across papers with different experimental conditions, identifying reproducibility gaps, or tracking performance improvements over time to quantify remaining headroom toward theoretical ceilings.
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/baseline-establishment && cp -r /tmp/baseline-establishment/skills/baseline-establishment ~/.claude/skills/baseline-establishmentSKILL.md
# Baseline Establishment ## Strategy Routing | User Intent | Route To | |-------------|----------| | Find all methods for a task | method-inventory | | Extract scores from papers | performance-extraction | | Normalize conditions across papers | condition-standardization | | Check reproducibility / discrepancies | discrepancy-analysis | | Track progress over time / headroom | progress-quantification | ## Manifest ### Strategies (5) | Strategy | Purpose | |----------|---------| | method-inventory | Comprehensively identify all relevant methods for a task | | performance-extraction | Systematically extract performance data and conditions from papers | | condition-standardization | Standardize evaluation condition differences across papers | | discrepancy-analysis | Identify discrepancies between reported and reproducible scores | | progress-quantification | Track performance progress over time, quantify remaining headroom | ### Tactics (3) | Tactic | Purpose | |--------|---------| | leaderboard-harvesting | Systematically collect performance data from platforms and papers | | condition-normalization | Compare and standardize experimental conditions across papers | | progress-curve-construction | Build performance-over-time progress curves | ### Subagent SOPs (10) | SOP | Purpose | |-----|---------| | method-discovery | Identify methods via literature, leaderboards, citation chains | | score-extraction | Extract (Task, Dataset, Metric, Score, Conditions) tuples | | condition-cataloging | Record evaluation conditions per method | | reproducibility-checklist-audit | Assess paper against ML Reproducibility Checklist | | performance-table-assembly | Assemble unified comparison table | | compute-normalization | Normalize results by compute budget | | discrepancy-identification | Compare same-method scores across sources | | headroom-estimation | Estimate ceiling vs current SOTA gap | | progress-curve-fitting | Construct performance-over-time data | | baseline-synthesis | Produce final structured baseline report | ## Budget Table | Strategy | Methods | Data Points | Web Searches | |----------|---------|-------------|--------------| | method-inventory | 50 | 0 | 60 | | performance-extraction | 30 | 150 | 40 | | condition-standardization | 20 | 60 | 30 | | discrepancy-analysis | 15 | 45 | 30 | | progress-quantification | 30 | 100 | 40 | | **TOTAL** | **145** | **355** | **200** | ## MCP Tools | MCP Server | Tools | |------------|-------| | brave-search | brave_web_search, brave_llm_context | | apify | rag-web-browser, google-scholar-scraper | | alphaxiv | get_paper_content, answer_pdf_queries | | semantic-scholar | ss_paper, ss_relevance_search, ss_citations, ss_references | ## Context Management Campaign outputs are accumulated in the calling knowledge-acquisition context: - `methods_inventory.json` — All discovered methods with metadata - `performance_data.json` — Extracted scores with provenance - `conditions_matrix.json` — Standardized conditions per method - `discrepancy_report.json` — Flagged score inconsistencies - `progress_curves.json` — Time-series performance data - `baseline_report.md` — Final synthesized baseline document
Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.
Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.
loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.
loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.
Strategy: 面对异常的最佳解释推理
Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
Map system architecture to ablatable units for ablation studies
Design ablation studies to isolate component contributions in ML systems