Skill389 estrellas del repoactualizado 20d ago

benchmark-audit

# ClaudeWave: benchmark-audit The benchmark-audit skill conducts systematic quality evaluations of AI/ML benchmarks using the BetterBench 46-criterion framework combined with Datasheets for Datasets standards and psychometric principles. Use this skill when you need to assess benchmark documentation completeness, construct validity, statistical robustness, maintenance status, and known failure modes across multiple benchmarks through structured analysis of papers, web searches, and documentation audits.

Ver fuente Repositorio: de-anthropocentric-research-engine

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-audit && cp -r /tmp/benchmark-audit/skills/benchmark-audit ~/.claude/skills/benchmark-audit

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Benchmark Audit Strategy

Systematic quality assessment of AI/ML benchmarks using the BetterBench 46-criterion framework, Datasheets for Datasets standards, and established psychometric evaluation principles.

## Purpose

Produce a structured quality report for each target benchmark covering: documentation completeness, construct validity indicators, statistical robustness, maintenance status, and known failure modes.

## Budget

| Resource | Floor | Target |
|----------|-------|--------|
| Benchmarks audited | 3 | 5 |
| Papers read | 20 | 30 |
| Web searches | 25 | 40 |

## State Ledger

```
<HARD-GATE>
| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Benchmarks audited | 0 | 5 | PENDING |
| Papers fetched | 0 | 30 | PENDING |
| Papers read | 0 | 20 | PENDING |
| Web searches | 0 | 40 | PENDING |
| Documentation audits complete | 0 | 5 | PENDING |
| Metric decompositions complete | 0 | 5 | PENDING |
| Contamination checks complete | 0 | 5 | PENDING |
| Synthesis reports produced | 0 | 5 | PENDING |
</HARD-GATE>
```

Cannot exit until 80% of all targets met.

## Available Tactics

- **artifact-detection** — Probe for annotation artifacts and dataset shortcuts

## Available SOPs

- **benchmark-inventory** — Identify target benchmarks in domain
- **metric-decomposition** — Decompose composite metrics into constituent signals
- **contamination-audit** — Detect train-test data leakage
- **documentation-audit** — Assess documentation completeness (BetterBench/Datasheets)
- **benchmark-synthesis** — Produce final structured audit report

## Execution Guidance

1. **Inventory Phase**: Use benchmark-inventory to identify 5 benchmarks in target domain
2. **Per-Benchmark Loop** (repeat for each benchmark):
   a. Gather benchmark paper, documentation, leaderboard via web searches
   b. Run documentation-audit against BetterBench 46 criteria
   c. Run metric-decomposition on primary metric(s)
   d. Run contamination-audit checking known training corpora
   e. Run artifact-detection tactic if annotation-based benchmark
   f. Collect findings into per-benchmark report
3. **Synthesis Phase**: Run benchmark-synthesis to produce cross-benchmark comparison

## Output Format

```yaml
benchmark_audit:
  benchmark_name: string
  version: string
  betterbench_score: float  # 0-1, proportion of 46 criteria met
  documentation_grade: A|B|C|D|F
  metric_analysis:
    primary_metric: string
    ceiling_effects: boolean
    polarity_issues: list
  contamination_risk: low|medium|high|critical
  artifact_risk: low|medium|high
  maintenance_status: active|stale|abandoned
  key_findings: list[string]
  recommendations: list[string]
```

<!-- BEGIN available-tables (generated) -->

## Available Tactics

Optional, no fixed order; the final leaf is always a sop.

| Tactic | When to use |
| --- | --- |
| artifact-detection | Detect annotation artifacts and shortcuts in benchmarks |

## Available SOPs

Optional, no fixed order; the final leaf is always a sop.

| SOP | When to use |
| --- | --- |
| benchmark-synthesis | Produce final structured audit report |
| contamination-audit | Detect train-test data leakage and memorization artifacts |
| documentation-audit | Assess documentation completeness against BetterBench/Datasheets standards |
| knowledge-acquisition-benchmark-inventory | Identify and catalog all relevant benchmarks in target domain |
| metric-decomposition | Decompose composite metrics into constituent signals, analyze polarity and ceiling effects |

<!-- END available-tables (generated) -->

Del mismo repositorio

formated-resultSkill

Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.

formated-specsSkill

Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.

injection-fidelitySkill

loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.

ladder-quality-orderSkill

loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.

abductive-hypothesis-generationSkill

Strategy: Inference to the best explanation in the face of anomalies

ablation-brainstormSkill

Remove components one by one, observe system changes to reveal hidden

ablation-component-mappingSkill

Map system architecture to ablatable units for ablation studies

ablation-designSkill

Design ablation studies to isolate component contributions in ML systems