Skill389 repo starsupdated 19d ago

benchmark-inventory

The benchmark-inventory skill systematically identifies and catalogs evaluation benchmarks within a specified research domain and capability area. It aggregates benchmark data from Papers With Code, Semantic Scholar, and web leaderboards, then structures findings with metadata including benchmark name, publication year, dataset size, primary metrics, and maintenance status. Use this skill when conducting literature reviews, selecting evaluation datasets for capability assessment, or analyzing the evaluation landscape of a research domain.

View source Repository: de-anthropocentric-research-engine

Install in Claude Code

Copy

git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-inventory && cp -r /tmp/benchmark-inventory/skills/benchmark-inventory ~/.claude/skills/benchmark-inventory

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Benchmark Inventory SOP

Identify, catalog, and characterize all relevant benchmarks for a given research domain and capability focus area.

## Input

- **research_domain**: The broad research area (e.g., "natural language understanding", "code generation", "multimodal reasoning")
- **capability_focus**: Specific capability of interest (e.g., "commonsense reasoning", "mathematical problem solving")

## Procedure

1. Search Papers With Code for benchmarks tagged with the domain
2. Search Semantic Scholar for benchmark papers in the domain
3. Search web for leaderboards and evaluation suites
4. For each benchmark found, collect: name, year, paper, size, primary metric, current SOTA, status
5. Classify by: capability tested, modality, difficulty level, maintenance status

## Output

Structured catalog of benchmarks with metadata sufficient for downstream analysis selection.