Skip to main content
ClaudeWave
Skill146 repo starsupdated yesterday

design-ai-benchmarking

>

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/design-ai-benchmarking && cp -r /tmp/design-ai-benchmarking/skills/design-ai-benchmarking ~/.claude/skills/design-ai-benchmarking
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Design-AI-Benchmarking Skill

## Purpose

This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that
the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where
`/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI
system(s) to a panel of human experts (or to each other) on rated outputs.

Use it when:
- one or more AI systems will be scored against a human-expert reference (reader study, annotation
  panel, AI-output evaluation, model-vs-model bench)
- a rubric and rating protocol must be locked before reviewers begin
- a benchmark feels vulnerable to "the highest score is just the most tautological item" or
  "low agreement, but we cannot tell why" criticism
- a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias

Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such
as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`);
or reviewing an already-written manuscript (use `/peer-review` or `/self-review`).

---

## Communication Rules

- Communicate with the user in their preferred language.
- Use English for statistical, machine-learning, and reporting-guideline terminology.
- Be direct about evaluation-validity risks, but always propose the smallest feasible fix first.
- Never invent reviewer ratings, reference labels, or agreement statistics; those come from collected
  data only.

---

## Standard Output

```text
## AI-Benchmark Design Review
Evaluation question: ...
Arms / systems compared: ...
Reference (human-expert panel): ...
Unit of rating: (item / case / output)

### Rubric (decoupled dimensions)
- dimension -> construct -> anchors (1..k)

### Calibration probes (blinded, randomized)
- positive-control / known-bad / instability / mechanism-contradiction

### Reviewer panel
- n reviewers, metadata captured, per-reviewer randomized order

### Reliability plan
- overall IRR target + control-item IRR (reported separately)

### Judge strategy
- human-as-judge / LLM-as-judge / both + adjudication rule

### Validity risks
1. ...

### Minimal fixes
- ...

### Decision
- Ready to collect / Needs rubric revision / Needs arm or judge redesign
```

---

## Workflow

### Phase 1: Define the evaluation question and arms

Pin down, in writing:
- the exact claim the benchmark must support (e.g., "system A's outputs are perceptually
  indistinguishable from expert outputs", not "system A is deployment-ready")
- every arm/system being compared, and what each arm receives as input (same items, same information
  access, same output format) so no arm has a hidden advantage
- the human-expert reference: who they are, and whether they set ground truth, provide a comparison
  arm, or both
- the unit of rating (item, case, output) and how many units each reviewer sees

**Gate:** Present the reconstructed evaluation question, arms, and reference to the user and confirm
before designing the rubric. A wrong reconstruction misdirects the entire benchmark.

### Phase 2: Design a decoupled multi-dimensional rubric

- **Decouple the axes.** Each rated dimension measures one construct. Keep "is the output valid/correct"
  separate from "is it novel", "is it feasible/measurable", "does it add value over current tools", and
  "would it change action". A candidate can be high-validity yet low-added-value ("real but redundant");
  a single blended score hides this divergence.
- **Anchor every scale point** with a short verbal descriptor; pilot the anchors with at least one
  reviewer before locking.
- **Pre-specify discriminant validity**: hypothesize which dimensions should correlate vs be orthogonal,
  then report the full inter-dimension correlation matrix to confirm the rubric measures distinct
  constructs.
- A worked rubric template lives in `${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md`.

### Phase 3: Insert and randomize calibration probes

Plant a small number of deliberate control items, blinded and randomized across raters (record who
received which via a `probe_arm` flag), to (i) anchor the scale, (ii) measure rater drift/fatigue, and
(iii) audit the rubric and pipeline itself. Four useful flavors:
- **Positive control / "too-good" item** — a known-strong or near-tautological item; tests whether
  raters equate "largest effect" with "best", and whether the construct-independence gate (Phase 7) works.
- **Known-bad negative control** — an engineered defect (fabricated reference, missing key statistic);
  expected to score low.
- **Instability item** — an estimate that reverses or fails to replicate on a holdout; tests
  caveat-handling.
- **Mechanism-contradiction item** — an empirical direction that opposes the proposed mechanism.

Probes are *planted or adjudicated*, never fabricated to fit a hypothesis.

### Phase 4: Construct the reviewer panel

- Recruit reviewers spanning the intended expertise gradient; pre-specify any expertise stratification.
- Capture reviewer metadata (years of experience, prior AI-evaluation experience, subspecialty) for
  descriptive reporting and stratified analysis.
- Randomize item order **per reviewer** (not one global seed) and record the order; plan to analyze
  order and fatigue effects.
- Require each item to be judged standalone; discourage cross-item references in free-text, which signal
  non-independent rating.

**Gate:** Present the panel composition, stratification, and randomization plan for user review before
recruitment is finalized.

### Phase 5: Set inter-rater reliability targets

- Pre-specify the agreement statistic (e.g., ICC for continuous ratings, weighted kappa for ordinal)
  and a target with justification.
- **Report reliability on the planted control items separatel
skillsSkill
academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

>

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study type classification → visualization → strategy report.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

>

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 32 guidelines including STROBE, CONSORT, STARD, STARD-AI, TRIPOD, TRIPOD+AI, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, CLAIM, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.