Skill223 repo starsupdated yesterday

design-ai-benchmarking

This skill pressure-tests AI-versus-human-expert benchmarks before rating collection to ensure fair comparison, distinct rubric constructs, calibrated scales, and interpretable reliability metrics. Use it when designing studies that score AI systems against human expert panels or competing models, requiring locked rubrics and protocols before reviewers begin, or when facing concerns about tautological items, low agreement without diagnosis, or uncontrolled rater drift and bias.

View source Repository: medsci-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/design-ai-benchmarking && cp -r /tmp/design-ai-benchmarking/skills/design-ai-benchmarking ~/.claude/skills/design-ai-benchmarking

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Design-AI-Benchmarking Skill

## Purpose

This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that
the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where
`/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI
system(s) to a panel of human experts (or to each other) on rated outputs.

Use it when:
- one or more AI systems will be scored against a human-expert reference (reader study, annotation
panel, AI-output evaluation, model-vs-model bench)
- a rubric and rating protocol must be locked before reviewers begin
- a benchmark feels vulnerable to "the highest score is just the most tautological item" or
"low agreement, but we cannot tell why" criticism
- a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias

Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such
as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`);
or reviewing an already-written manuscript (use `/peer-review` or `/self-review`).

---

## Communication Rules

- Communicate with the user in their preferred language.
- Use English for statistical, machine-learning, and reporting-guideline terminology.
- Be direct about evaluation-validity risks, but always propose the smallest feasible fix first.
- Never invent reviewer ratings, reference labels, or agreement statistics; those come from collected
data only.

---

## Standard Output

```text
## AI-Benchmark Design Review
Evaluation question: ...
Arms / systems compared: ...
Reference (human-expert panel): ...
Unit of rating: (item / case / output)

### Rubric (decoupled dimensions)
- dimension -> construct -> anchors (1..k)

### Calibration probes (blinded, randomized)
- positive-control / known-bad / instability / mechanism-contradiction

### Reviewer panel
- n reviewers, metadata captured, per-reviewer randomized order

### Reliability plan
- overall IRR target + control-item IRR (reported separately)

### Judge strategy
- human-as-judge / LLM-as-judge / both + adjudication rule

### Validity risks
1. ...

### Minimal fixes
- ...

### Decision
- Ready to collect / Needs rubric revision / Needs arm or judge redesign
```

---

## Workflow

### Phase 1: Define the evaluation question and arms

Pin down, in writing:
- the exact claim the benchmark must support (e.g., "system A's outputs are perceptually
indistinguishable from expert outputs", not "system A is deployment-ready")
- every arm/system being compared, and what each arm receives as input (same items, same information
access, same output format) so no arm has a hidden advantage
- the human-expert reference: who they are, and whether they set ground truth, provide a comparison
arm, or both
- the unit of rating (item, case, output) and how many units each reviewer sees

**Gate:** Present the reconstructed evaluation question, arms, and reference to the user and confirm
before designing the rubric. A wrong reconstruction misdirects the entire benchmark.

### Phase 2: Design a decoupled multi-dimensional rubric

- **Decouple the axes.** Each rated dimension measures one construct. Keep "is the output valid/correct"
separate from "is it novel", "is it feasible/measurable", "does it add value over current tools", and
"would it change action". A candidate can be high-validity yet low-added-value ("real but redundant");
a single blended score hides this divergence.
- **Anchor every scale point** with a short verbal descriptor; pilot the anchors with at least one
reviewer before locking.
- **Pre-specify discriminant validity**: hypothesize which dimensions should correlate vs be orthogonal,
then report the full inter-dimension correlation matrix to confirm the rubric measures distinct
constructs.
- A worked rubric template lives in `${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md`.

### Phase 3: Insert and randomize calibration probes

Plant a small number of deliberate control items, blinded and randomized across raters (record who
received which via a `probe_arm` flag), to (i) anchor the scale, (ii) measure rater drift/fatigue, and
(iii) audit the rubric and pipeline itself. Four useful flavors:
- **Positive control / "too-good" item** — a known-strong or near-tautological item; tests whether
raters equate "largest effect" with "best", and whether the construct-independence gate (Phase 7) works.
- **Known-bad negative control** — an engineered defect (fabricated reference, missing key statistic);
expected to score low.
- **Instability item** — an estimate that reverses or fails to replicate on a holdout; tests
caveat-handling.
- **Mechanism-contradiction item** — an empirical direction that opposes the proposed mechanism.

Probes are *planted or adjudicated*, never fabricated to fit a hypothesis.

### Phase 4: Construct the reviewer panel

- Recruit reviewers spanning the intended expertise gradient; pre-specify any expertise stratification.
- Capture reviewer metadata (years of experience, prior AI-evaluation experience, subspecialty) for
descriptive reporting and stratified analysis.
- Randomize item order **per reviewer** (not one global seed) and record the order; plan to analyze
order and fatigue effects.
- Require each item to be judged standalone; discourage cross-item references in free-text, which signal
non-independent rating.

**Gate:** Present the panel composition, stratification, and randomization plan for user review before
recruitment is finalized.

### Phase 5: Set inter-rater reliability targets

- Pre-specify the agreement statistic (e.g., ICC for continuous ratings, weighted kappa for ordinal)
and a target with justification.
- **Report reliability on the planted control items separatel

More from this repository

skillsSkill

academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study-type classification → visualization → strategy report → optional trajectory-archetype classification.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 36 guidelines including STROBE, CONSORT, CONSORT-AI, STARD, STARD-AI, TRIPOD, TRIPOD+AI, TRIPOD-LLM, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, SPIRIT-AI, CLAIM, DECIDE-AI, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.