Skill389 estrellas del repoactualizado 20d ago

benchmark-inventory

The benchmark-inventory skill systematically identifies and catalogs evaluation benchmarks within a specified research domain and capability area. It aggregates benchmark data from Papers With Code, Semantic Scholar, and web leaderboards, then structures findings with metadata including benchmark name, publication year, dataset size, primary metrics, and maintenance status. Use this skill when conducting literature reviews, selecting evaluation datasets for capability assessment, or analyzing the evaluation landscape of a research domain.

Ver fuente Repositorio: de-anthropocentric-research-engine

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-inventory && cp -r /tmp/benchmark-inventory/skills/benchmark-inventory ~/.claude/skills/benchmark-inventory

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Benchmark Inventory SOP

Identify, catalog, and characterize all relevant benchmarks for a given research domain and capability focus area.

## Input

- **research_domain**: The broad research area (e.g., "natural language understanding", "code generation", "multimodal reasoning")
- **capability_focus**: Specific capability of interest (e.g., "commonsense reasoning", "mathematical problem solving")

## Procedure

1. Search Papers With Code for benchmarks tagged with the domain
2. Search Semantic Scholar for benchmark papers in the domain
3. Search web for leaderboards and evaluation suites
4. For each benchmark found, collect: name, year, paper, size, primary metric, current SOTA, status
5. Classify by: capability tested, modality, difficulty level, maintenance status

## Output

Structured catalog of benchmarks with metadata sufficient for downstream analysis selection.

Del mismo repositorio

formated-resultSkill

Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.

formated-specsSkill

Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.

injection-fidelitySkill

loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.

ladder-quality-orderSkill

loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.

abductive-hypothesis-generationSkill

Strategy: Inference to the best explanation in the face of anomalies

ablation-brainstormSkill

Remove components one by one, observe system changes to reveal hidden

ablation-component-mappingSkill

Map system architecture to ablatable units for ablation studies

ablation-designSkill

Design ablation studies to isolate component contributions in ML systems