git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/define-variables && cp -r /tmp/define-variables/skills/define-variables ~/.claude/skills/define-variablesSKILL.md
# Define-Variables Skill
## Purpose
Every observational study operationalizes abstract constructs (MASLD, CKD, emphysema, obesity, incidentaloma) into concrete rules against the available data dictionary. When that operationalization is invented ad-hoc from the dictionary alone, reviewers reject on construct validity regardless of downstream statistics.
This skill forces a **literature-first** pass: each variable is mapped to a canonical guideline/consensus definition, cross-checked against prior operationalizations in comparable cohorts, then mapped to available DB variables. Ad-hoc deviations are flagged explicitly and justified, not hidden.
Use it when:
- a study question is known and variables are being selected
- inclusion/exclusion criteria or phenotype definitions need citation backing
- a data dictionary has ambiguous or derived variables (eGFR formula, BMI class, liver steatosis criteria, etc.)
- a reviewer asked "why this cutoff?"
- a retrospective audit reveals drifted definitions across projects in the same cohort
Call after `/design-study`, before `/write-protocol`.
## Communication Rules
- Communicate in the user's preferred language.
- All variable names, guideline names, cutoffs in English.
- Produce one artifact: `variable_operationalization.md` in the project root (or path the user specifies).
## Inputs
1. **Research question** (one sentence)
2. **Candidate variables** — exposure, outcome, key covariates, eligibility filters
3. **Data dictionary path** (xlsx / csv / markdown) OR explicit list of available DB columns
4. **Cohort type** (e.g., health-screening, NHANES-like, claims, registry) — informs which prior-art cohort to compare against
Missing inputs → ask once, then proceed.
## 4-Tier Pipeline (DB codebook + token-efficient literature)
### Tier 0 — DB codebook lookup (mandatory for DB-backed observational studies)
**Trigger**: project has a `project.yaml::db.dictionary_path` field pointing to a machine-readable codebook (xlsx/csv/markdown), OR user supplied a dictionary path in inputs. If neither, skip to Tier 1.
For every candidate DB variable — **before** touching literature — open the dictionary and record, verbatim, the sheet name, row number, and code→meaning mapping. This prevents the single most common observational-study error: assuming a column code (`status == 0`, `grade == 4`) means what it intuitively reads like, when the codebook says otherwise.
Concrete procedure per variable:
1. Locate the variable in the dictionary by exact column name.
2. Copy verbatim: the sheet title, row number, and full code→meaning mapping (or unit/range statement for continuous vars).
3. Paste into the `Dict. sheet & row` + `Dict. verbatim` columns of the operationalization table.
4. If the variable is not found, OR the codebook is silent on a specific code value, file a question to the DB owner / data steward. Do NOT infer from cross-tabs, do NOT guess, do NOT proceed with that variable until a verbatim answer exists.
Empirical checks (value distributions, cross-tabs with related columns) are useful for sanity testing **after** the verbatim codebook meaning is recorded — never as a substitute for it.
Project-level binding (recommended): commit a `DICTIONARY_FIRST_POLICY.md` at the project root (or shared-config path) capturing the canonical dictionary path + escalation contact. Cross-project rule template: `~/.claude/rules/dictionary-first.md`.
**Exit gate**: `check_dictionary_citations.py` (or equivalent) PASS on the operationalization table before running Tier 1.
### Tier 1 — Canonical index lookup (no API calls)
Check `references/common_definitions.md` (shipped with skill) for the variable. Covers high-frequency constructs:
- Liver: MASLD (AASLD 2023), MetALD (AASLD 2023), MAFLD (2020), NAFLD (legacy), ALD, viral hepatitis (AASLD 2022/2024 HBV, AASLD-IDSA HCV)
- Metabolic: T2DM (ADA 2024), prediabetes (ADA 2024), metabolic syndrome (IDF 2009 / NCEP ATP III / K-NCEP), obesity/BMI (WHO Asian 2004 + WHO global), HTN (ACC/AHA 2017 + JNC-8), dyslipidemia (NCEP ATP III, 2023 AHA/ACC)
- Renal: CKD (KDIGO 2024), eGFR formulas (CKD-EPI 2021 race-free, MDRD legacy), incidental renal mass (ACR 2018 white paper, Bosniak 2019)
- Pulmonary: COPD (GOLD 2024), emphysema imaging (Fleischner 2015)
- CV: CAC scoring (Agatston 1990, MESA percentiles), CAD risk (2018 ACC/AHA cholesterol, PREVENT 2023)
- Cancer: gastric cancer H. pylori (Maastricht VI 2022), thyroid nodule (ACR TI-RADS 2017), gallbladder polyp (European 2022 joint guideline)
- Imaging incidentalomas: adrenal (ACR 2023), pancreas (ACR 2017), renal (ACR 2018), thyroid (ACR 2017)
If the variable hits Tier 1, record: guideline, year, canonical cutoff, BibTeX key. Done — no `/search-lit` call.
### Tier 2 — Targeted `/search-lit` (focused queries only)
For variables NOT in Tier 1, OR when subgroup justification is needed (Asian-specific cutoff, pediatric, young-adult, pregnancy, etc.), call `/search-lit` with **one query per variable** — not a general sweep. Query pattern:
```
"{construct} definition {cohort type} {subgroup qualifier}"
e.g., "obstructive sleep apnea prevalence Korean health screening cohort"
```
Cap: 5 queries per session. Stop early if first 1-2 papers converge on the same definition.
### Tier 3 — Verification
Before finalizing, run `/verify-refs` on the accumulated BibTeX to confirm every citation exists in PubMed/CrossRef. Ad-hoc choices (no canonical source found) must be flagged `Ad-hoc: yes` and justified with 1-2 sentences — never hidden.
## Output Template
Write to `{project_root}/variable_operationalization.md` using `templates/variable_operationalization.md`. Required structure:
1. **Header**: research question, cohort type, date, author
2. **Operationalization table** — one row per variable:
| Variable | Role | Dict. sheet & row | Dict. verbatim | Canonical source | Definition | Cutoff | DB vars | Implementation | Ad-hoc? |
- `Role`: exposure / outcome / covariateMedical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.
>
Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.
PubMed author profile analysis. Author name → PubMed fetch → study type classification → visualization → strategy report.
Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.
>
Check manuscript compliance with medical research reporting guidelines. Supports 32 guidelines including STROBE, CONSORT, STARD, STARD-AI, TRIPOD, TRIPOD+AI, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, CLAIM, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.