batch-cohort
Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.
git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/batch-cohort && cp -r /tmp/batch-cohort/skills/batch-cohort ~/.claude/skills/batch-cohortSKILL.md
# Batch Cohort Analysis Skill
You are assisting a medical researcher in generating multiple analysis scripts from a single
validated methodology template, each differing only in the exposure/outcome variable combination.
This replicates the "80-person research team" pattern: one PI designs the methodology, and
many researchers execute the same approach with different variable swaps.
## When to Use
- Researcher has a **validated analysis template** (e.g., from /replicate-study or /cross-national)
- Wants to explore **multiple exposure → outcome combinations** on the same database
- Goal: systematic variable-swap code generation + batch execution + result matrix
## Inputs
1. **Database path(s)**: CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort)
2. **Methodology template**: One of:
- Path to a validated R/Python analysis script (from /replicate-study or /cross-national)
- A paper type template name: `nhis_cohort`, `cross_national`, `survey_weighted`
- A source paper to extract methodology from (falls back to /replicate-study Phase 1)
3. **Combination spec**: A list of exposure/outcome pairs, provided as:
- Inline list: `exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD]`
- CSV file with columns: `exposure`, `outcome`, (optional) `subgroup_vars`
- `"all"` keyword: generates all pairwise combinations from the lists
### Optional Inputs
- **Covariate set**: Fixed covariate list for all analyses (default: use template's set)
- **Subgroup variables**: Variables to stratify by (default: sex, age group)
- **Output format**: `code_only` (just scripts) | `execute` (run + collect results) | `full` (code + results + summary)
- **Cross-national mode**: If TRUE, generates paired scripts for both countries per combination
## Workflow
### Phase 1: Template Validation
1. Read the methodology template (R script or paper type reference).
2. Identify the **slot variables** — parts that change per combination:
- `EXPOSURE_VAR`: raw variable name in the database
- `EXPOSURE_LABEL`: human-readable label for tables/figures
- `EXPOSURE_CODING`: how to derive binary/categorical exposure
- `OUTCOME_VAR`: raw variable name
- `OUTCOME_LABEL`: human-readable label
- `OUTCOME_CODING`: how to derive binary outcome
3. Verify the template runs successfully on at least one combination before batch generation.
4. Output: template summary with identified slots → user approval.
### Phase 2: Variable Specification
For each exposure and outcome in the combination spec:
1. **Look up** the variable in the database:
- KNHANES: check variable name exists in the CSV header
- NHANES: check which table contains the variable (use codebook.csv if available)
- NHIS: check claims code or variable name
2. **Define coding**:
- Binary: threshold or category mapping (e.g., `HE_glu >= 126 → diabetes = 1`)
- Categorical: level definitions (e.g., `smoking: current/former/never`)
3. **Check covariate overlap**: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment).
4. Output: **combination matrix** with all variable specifications.
```
| # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes |
|---|----------|-----------------|---------|----------------|----------------------|-------|
| 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — |
| 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates |
| ... | | | | | | |
```
### Phase 3: Batch Code Generation
For each combination in the matrix:
1. **Clone** the template script.
2. **Replace** slot variables with the combination-specific values.
3. **Adjust covariates**: Remove exposure variable from covariate list if present.
4. **Set output paths**: Each combination gets its own results subdirectory.
5. **Generate a master runner script** (`run_all.R` or `run_all.sh`) that:
- Executes all N scripts sequentially (or in parallel via `future`/`parallel`)
- Captures errors per script without stopping the batch
- Logs execution time per analysis
### Phase 4: Batch Execution (if `execute` or `full` mode)
1. Run the master script.
2. Collect results from each combination's output directory.
3. Handle failures gracefully:
- Log which combinations failed and why
- Common failures: convergence issues, too few events, empty subgroups
- Suggest fixes for failed combinations
### Phase 5: Summary Matrix
Aggregate all results into a single summary:
**Main Results Matrix** (`summary_matrix.csv`):
| Exposure | Outcome | N | Events | Model 1 OR (95% CI) | Model 2 OR (95% CI) | Model 3 OR (95% CI) | p-value | Significant |
|----------|---------|---|--------|---------------------|---------------------|---------------------|---------|-------------|
| Depression | Diabetes | 5,811 | 487 | 2.14 (1.52–3.01) | 1.89 (1.33–2.69) | 1.36 (0.91–2.05) | 0.137 | No |
| Obesity | Diabetes | 5,811 | 487 | 3.45 (2.71–4.39) | 3.38 (2.65–4.32) | 3.12 (2.42–4.02) | <0.001 | Yes |
| ... | | | | | | | | |
**Subgroup Summary** (`subgroup_matrix.csv`): Same format, stratified by subgroup variables.
**Heatmap** (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis.
## Output Files
```
{working_dir}/batch_{timestamp}/
├── README.md — Batch run summary (N combinations, template used, date)
├── combination_matrix.csv — All exposure/outcome specs with coding
├── template/
│ └── base_template.R — The validated template (frozen copy)
├── scripts/
│ ├── 01_depression_diabetes.R
│ ├── 02_obesity_diabetes.R
│ ├── ...
│ └── run_all.R — Master execution script
├── results/
│ ├── 01_depression_diabetes/
│ │ ├── table1.csv
│ │ ├── main_rMedical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.
>
Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.
PubMed author profile analysis. Author name → PubMed fetch → study type classification → visualization → strategy report.
>
Check manuscript compliance with medical research reporting guidelines. Supports 32 guidelines including STROBE, CONSORT, STARD, STARD-AI, TRIPOD, TRIPOD+AI, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, CLAIM, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.
Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — all decisions require researcher confirmation.