Skip to main content
ClaudeWave
Skill329 estrellas del repoactualizado 5d ago

benchmark-inventory

The benchmark-inventory skill systematically identifies and catalogs evaluation benchmarks within a specified research domain and capability area. It aggregates benchmark data from Papers With Code, Semantic Scholar, and web leaderboards, then structures findings with metadata including benchmark name, publication year, dataset size, primary metrics, and maintenance status. Use this skill when conducting literature reviews, selecting evaluation datasets for capability assessment, or analyzing the evaluation landscape of a research domain.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-inventory && cp -r /tmp/benchmark-inventory/skills/benchmark-inventory ~/.claude/skills/benchmark-inventory
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Benchmark Inventory SOP

Identify, catalog, and characterize all relevant benchmarks for a given research domain and capability focus area.

## Input

- **research_domain**: The broad research area (e.g., "natural language understanding", "code generation", "multimodal reasoning")
- **capability_focus**: Specific capability of interest (e.g., "commonsense reasoning", "mathematical problem solving")

## Procedure

1. Search Papers With Code for benchmarks tagged with the domain
2. Search Semantic Scholar for benchmark papers in the domain
3. Search web for leaderboards and evaluation suites
4. For each benchmark found, collect: name, year, paper, size, primary metric, current SOTA, status
5. Classify by: capability tested, modality, difficulty level, maintenance status

## Output

Structured catalog of benchmarks with metadata sufficient for downstream analysis selection.