Skip to main content
ClaudeWave
Skill329 repo starsupdated 5d ago

benchmark-inventory

The benchmark-inventory skill systematically identifies and catalogs evaluation benchmarks within a specified research domain and capability area. It aggregates benchmark data from Papers With Code, Semantic Scholar, and web leaderboards, then structures findings with metadata including benchmark name, publication year, dataset size, primary metrics, and maintenance status. Use this skill when conducting literature reviews, selecting evaluation datasets for capability assessment, or analyzing the evaluation landscape of a research domain.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-inventory && cp -r /tmp/benchmark-inventory/skills/benchmark-inventory ~/.claude/skills/benchmark-inventory
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Benchmark Inventory SOP

Identify, catalog, and characterize all relevant benchmarks for a given research domain and capability focus area.

## Input

- **research_domain**: The broad research area (e.g., "natural language understanding", "code generation", "multimodal reasoning")
- **capability_focus**: Specific capability of interest (e.g., "commonsense reasoning", "mathematical problem solving")

## Procedure

1. Search Papers With Code for benchmarks tagged with the domain
2. Search Semantic Scholar for benchmark papers in the domain
3. Search web for leaderboards and evaluation suites
4. For each benchmark found, collect: name, year, paper, size, primary metric, current SOTA, status
5. Classify by: capability tested, modality, difficulty level, maintenance status

## Output

Structured catalog of benchmarks with metadata sufficient for downstream analysis selection.