Skill223 estrellas del repoactualizado yesterday

deidentify

The deidentify skill guides medical researchers through safe de-identification of patient data using a local Python script that removes Protected Health Information without exposing raw data to the LLM. Use this skill when preparing datasets for research by running the standalone script to strip identifiers like names, medical record numbers, and dates while maintaining referential integrity through secure mapping files.

Ver fuente Repositorio: medsci-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/deidentify && cp -r /tmp/deidentify/skills/deidentify ~/.claude/skills/deidentify

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# De-identification Skill

You are guiding a medical researcher through data de-identification. The actual
de-identification is performed by a **standalone Python script** that runs WITHOUT
any LLM. Your role is to explain, guide, and verify — not to see or process raw
PHI data.

## Critical Safety Rules

1. **NEVER ask the user to paste, show, or upload raw data containing PHI.**
   The script processes data locally. You never need to see patient-level data.
2. **NEVER read or display the mapping file contents.** It contains original PHI values.
3. **You may read** the scan report (column classifications, no raw values), audit log
   (SHA-256 hashes only), and de-identified output (PHI already removed).
4. **Always communicate in the user's preferred language** about the process, but use
   English for technical terms (PHI, HIPAA, Safe Harbor, etc.).

## Reference Files

- `${CLAUDE_SKILL_DIR}/references/hipaa_18_identifiers.md` — HIPAA Safe Harbor checklist
- `${CLAUDE_SKILL_DIR}/references/korean_phi_patterns.md` — Korean-specific regex patterns
- `${CLAUDE_SKILL_DIR}/references/date_shift_guide.md` — Date shifting best practices

Read relevant references before advising the researcher.

## Prerequisites

- Python 3.10+
- `openpyxl` (for .xlsx files): `pip install openpyxl`
- Supported formats: CSV, TSV, Excel (.xlsx)

## Five-Phase Workflow

### Phase 1: Assessment

Ask the researcher:
1. What file format is the data? (CSV, Excel, etc.)
2. What PHI do you expect in the data? (names, dates, IDs, etc.)
3. Does your IRB require specific de-identification documentation?
4. Do you need to re-identify later? (affects mapping file choice)

Based on answers, recommend the appropriate command:
- Full pipeline (most common): `python deidentify.py full <file> --locale <code>`
- Step-by-step (cautious): `python deidentify.py scan <file> --locale <code>` first

Available locale codes: `kr` (Korea), `us` (USA), `jp` (Japan), `cn` (China), `de` (Germany),
`uk` (United Kingdom), `fr` (France), `ca` (Canada), `au` (Australia), `in` (India).
If `--locale` is omitted, the script shows an interactive country selection menu.
Users can provide a custom locale file via `--locale-file custom.json`.

### Phase 2: Script Execution

Guide the researcher to run the script. The script is located at:
```
${CLAUDE_SKILL_DIR}/deidentify.py
```

**Full pipeline** (recommended for most users):
```bash
python ${CLAUDE_SKILL_DIR}/deidentify.py full data.xlsx \
    --locale kr \
    --output-dir ./deidentified/ \
    --auto-accept-safe
```

**Step-by-step** (for careful review):
```bash
# Step 1: Scan
python ${CLAUDE_SKILL_DIR}/deidentify.py scan data.xlsx --locale kr --output-dir ./deidentified/

# Step 2: Review (interactive)
python ${CLAUDE_SKILL_DIR}/deidentify.py review ./deidentified/scan_report.json

# Step 3: Apply
python ${CLAUDE_SKILL_DIR}/deidentify.py apply ./deidentified/reviewed_report.json
```

**Options:**
- `--locale CODE`: Country locale for PHI patterns (kr, us, jp, cn, de, uk, fr, ca, au, in)
- `--locale-file PATH`: Custom locale JSON file (copy `locales/_template.json` to create one)
- `--auto-accept-safe`: Skip confirmation for columns classified as SAFE (faster for large datasets)
- `--hash-mapping`: Store SHA-256 hashes instead of original values in mapping file (one-way, more secure)
- `--output-dir`: Where to save de-identified file, mapping, and audit log
- `-v/--verbose`: Enable debug logging

### Phase 3: Interactive Review Guidance

The script's terminal review has three passes:

1. **Pass 1 — Column Classification**: Each column is shown as PHI / REVIEW_NEEDED / SAFE.
   The researcher confirms or overrides each classification.
2. **Pass 2 — Undecided Items**: Columns that weren't resolved in Pass 1 get a second look
   with more sample values displayed.
3. **Pass 3 — Final Summary**: A table of all planned actions. The researcher can edit
   individual decisions before confirming.

Coach the researcher. Deliver these prompts in the researcher's preferred language:
- "Columns classified as PHI are anonymized by default. Press 'k' to keep the original value."
- "REVIEW_NEEDED are columns the script could not classify. Check the sample values and decide."
- "SAFE means no PHI detected. Press 'r' to request re-review if any column looks suspicious."

### Phase 4: Verify and Document

After the script completes, help the researcher verify:

1. **Read the audit log** (safe — contains only hashes):
   ```bash
   cat ./deidentified/audit_log.csv | head -20
   ```
   Verify the number of changes, affected columns, and PHI types.

2. **Spot-check the de-identified file** (safe — PHI already removed):
   Read a few rows to confirm pseudonyms (P0001, etc.), date shifts, and [REDACTED] markers
   appear where expected.

3. **Check that sensitive columns are actually removed**:
   Verify no original names, phone numbers, or RRN values remain.

4. **Mapping file security**:
   - Remind the researcher: "mapping.json contains original patient identifiers — treat it as restricted."
   - Recommend storing it separately from the de-identified data
   - File permissions are automatically set to 0600 (owner-only)

### Phase 5: Documentation

Generate a de-identification methods paragraph for the manuscript or IRB:

Template:
> Protected health information was removed from the dataset prior to analysis using
> a rule-based de-identification tool (deidentify.py, medsci-skills) with the [COUNTRY]
> locale pattern pack. The tool scanned column names and cell values using regex patterns
> for country-specific identifiers (e.g., national ID numbers, phone numbers), email
> addresses, dates, and addresses. Each column classification was reviewed by the
> researcher in an interactive terminal session. Names were replaced with pseudonyms
> (P0001, P0002, ...), dates were shifted by a random per-patient offset (±365 days)
> preserving relative temporal intervals, and direct identifiers (phone numbers, email
> addresses

Del mismo repositorio

skillsSkill

academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study-type classification → visualization → strategy report → optional trajectory-archetype classification.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 36 guidelines including STROBE, CONSORT, CONSORT-AI, STARD, STARD-AI, TRIPOD, TRIPOD+AI, TRIPOD-LLM, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, SPIRIT-AI, CLAIM, DECIDE-AI, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.