benchmark-inventory
The benchmark-inventory skill systematically identifies and catalogs evaluation benchmarks within a specified research domain and capability area. It aggregates benchmark data from Papers With Code, Semantic Scholar, and web leaderboards, then structures findings with metadata including benchmark name, publication year, dataset size, primary metrics, and maintenance status. Use this skill when conducting literature reviews, selecting evaluation datasets for capability assessment, or analyzing the evaluation landscape of a research domain.
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-inventory && cp -r /tmp/benchmark-inventory/skills/benchmark-inventory ~/.claude/skills/benchmark-inventorySKILL.md
# Benchmark Inventory SOP Identify, catalog, and characterize all relevant benchmarks for a given research domain and capability focus area. ## Input - **research_domain**: The broad research area (e.g., "natural language understanding", "code generation", "multimodal reasoning") - **capability_focus**: Specific capability of interest (e.g., "commonsense reasoning", "mathematical problem solving") ## Procedure 1. Search Papers With Code for benchmarks tagged with the domain 2. Search Semantic Scholar for benchmark papers in the domain 3. Search web for leaderboards and evaluation suites 4. For each benchmark found, collect: name, year, paper, size, primary metric, current SOTA, status 5. Classify by: capability tested, modality, difficulty level, maintenance status ## Output Structured catalog of benchmarks with metadata sufficient for downstream analysis selection.
Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.
Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.
loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.
loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.
Strategy: 面对异常的最佳解释推理
Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
Map system architecture to ablatable units for ablation studies
Design ablation studies to isolate component contributions in ML systems