Skill223 estrellas del repoactualizado yesterday

generate-codebook

**Generate Codebook** profiles a tabular dataset (CSV, TSV, Excel, Parquet, Stata, SAS) to produce a machine-readable codebook.json and a human-reviewable codebook.md file. It documents every variable's role, data type, missingness, cardinality, and distribution, while explicitly flagging coded values as [NEEDS DICTIONARY] instead of guessing their meanings. Use this skill as the first step in the dictionary-first workflow to establish what is actually in the data before defining variable meanings downstream.

Ver fuente Repositorio: medsci-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/generate-codebook && cp -r /tmp/generate-codebook/skills/generate-codebook ~/.claude/skills/generate-codebook

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Generate Codebook Skill

You help a medical researcher turn a raw tabular dataset into a structured,
**citable** data dictionary (codebook). This is the *generator* side of the
dictionary-first workflow: it produces the artifact that `/define-variables` and
dictionary-first QC later consume. You generate code and review output — you do
**not** invent the meaning of coded values.

## Communication Rules

- Communicate with the user in their preferred language.
- Variable names, codebook fields, and report output are in English.
- Medical terminology is always in English.

## Philosophy

A codebook describes *what is in the data*, not *what the codes mean*. Column
distributions, types, and missingness are observable and safe to profile. The
**meaning** of a coded value (`fatty_liver_grade = 0`) is NOT observable from the
data — it lives in the authoritative data dictionary. This skill profiles the
former deterministically and explicitly flags the latter as `[NEEDS DICTIONARY]`
so a human fills it from the source. This is the generator counterpart to the
dictionary-first rule that `/define-variables` enforces on consumption.

## Reference Files

- **Schema + role rules**: `${CLAUDE_SKILL_DIR}/references/codebook_schema.md` — the
  codebook.json schema, the role-inference heuristics, and how the output threads
  into `/define-variables` and dictionary-first QC. Read this before interpreting output.

## Deterministic Script

Run the bundled profiler rather than describing columns from memory:

```bash
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" data.csv --out-dir .
```

Supports `.csv/.tsv/.xlsx/.parquet/.dta/.sas7bdat`. Flags: `--max-levels N`
(categorical cutoff, default 20), `--json-only`, `--md-only`. The script is
pandas-only, runs locally, and never sends data anywhere.

## Workflow

### Step 1: Profile (deterministic)

Run `generate_codebook.py` on the dataset. It writes `codebook.json` (machine-
readable) and `codebook.md` (review table), reporting per variable: role
(id / continuous / categorical / binary / date / text), dtype, missingness,
unique count, level frequencies or quantile summary, and a `needs_dictionary` flag.

### Step 2: Review with the researcher (gate)

Present `codebook.md` and walk the user through it. **Gate:** the user confirms
the inferred roles (e.g., an integer-coded scale mis-read as continuous, or an id
column). Do not proceed to definition work until the user approves the role
assignments.

### Step 3: Resolve [NEEDS DICTIONARY] items (gate)

For every variable flagged `needs_dictionary: true`, the level codes are
uninterpretable without the authoritative source. **Gate:** ask the user to
supply the meaning of each code from the real data dictionary (file/sheet/row),
or to confirm none exists. Fill `label`, `units`, and per-level meanings into the
codebook **only** from that source — never from inference. If the user cannot
supply it, leave the `[NEEDS DICTIONARY]` marker in place; do not erase it.

### Step 4: Hand off

The completed `codebook.json` becomes the input dictionary for `/define-variables`
(operationalization) and the citation source for dictionary-first QC. **Gate:**
confirm with the user that no `needs_dictionary` flags remain unresolved before
the codebook is treated as authoritative for downstream analysis.

## Scope Limitations

### Supported
- Tabular files: CSV, TSV, Excel, Parquet, Stata (`.dta`), SAS (`.sas7bdat`).
- Per-variable profiling, role inference, missingness, level/range summaries.

### NOT Supported
- Inventing or guessing the meaning of coded values (that is `[NEEDS DICTIONARY]`).
- Cleaning or transforming data — use `/clean-data`.
- De-identification — use `/deidentify` before sharing.
- Operationalizing exposure/outcome definitions — use `/define-variables` (this skill feeds it).

## Cross-Skill Integration

- **/define-variables** consumes `codebook.json` as its data dictionary input.
- **/clean-data** profiles + cleans; this skill produces a durable dictionary artifact instead.
- **/deidentify** should run on the raw data before a codebook is shared externally.

## Output Format

`codebook.json` (schema in references) and `codebook.md` (review table with a
"Columns requiring dictionary lookup" section). Summarize the counts
(rows, columns, `needs_dictionary_count`) in chat; do not paste the full JSON.

## Worked Example

Input `cohort.csv`:

```text
patient_id,age,sex,fatty_liver_grade,smoking_status,visit_date
1001,54,1,0,never,2023-01-15
1002,61,2,2,former,2023-02-03
```

Run:

```bash
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" cohort.csv --out-dir .
# -> {"n_rows": ..., "n_columns": 6, "needs_dictionary_count": 2, "outputs": [...]}
```

`codebook.md` (excerpt):

```text
| Variable            | Role        | Missing % | Unique | Needs dictionary |
| `patient_id`        | id          | 0.0       | N      |                  |
| `age`               | continuous  | 0.0       | ...    |                  |
| `sex`               | binary      | 0.0       | 2      | ⚠️ YES           |
| `fatty_liver_grade` | categorical | 0.0       | 5      | ⚠️ YES           |
| `smoking_status`    | categorical | 0.0       | 3      |                  |
| `visit_date`        | date        | 0.0       | ...    |                  |
```

`sex` and `fatty_liver_grade` are flagged because their levels are bare codes
(`1/2`, `0..4`). `smoking_status` is **not** flagged — its levels are already
human-readable. The reviewer then:

1. Opens the project's authoritative data dictionary.
2. Fills `sex`: `1 = male, 2 = female` and `fatty_liver_grade`: `0 = none … 4 = suspected`
   into the codebook **from that source** (citing file > sheet > row).
3. Confirms no `[NEEDS DICTIONARY]` flags remain, then hands `codebook.json` to
   `/define-variables`.

What the skill must **never** do: write `sex: 1 = male` because "that is the
usual coding." If the dictionary is unavailable, the flag stays.

## Anti-Hallucination

- Never

Del mismo repositorio

skillsSkill

academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study-type classification → visualization → strategy report → optional trajectory-archetype classification.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 36 guidelines including STROBE, CONSORT, CONSORT-AI, STARD, STARD-AI, TRIPOD, TRIPOD+AI, TRIPOD-LLM, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, SPIRIT-AI, CLAIM, DECIDE-AI, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.