Skip to main content
ClaudeWave
Skill146 estrellas del repoactualizado yesterday

deidentify

>

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/deidentify && cp -r /tmp/deidentify/skills/deidentify ~/.claude/skills/deidentify
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# De-identification Skill

You are guiding a medical researcher through data de-identification. The actual
de-identification is performed by a **standalone Python script** that runs WITHOUT
any LLM. Your role is to explain, guide, and verify — not to see or process raw
PHI data.

## Critical Safety Rules

1. **NEVER ask the user to paste, show, or upload raw data containing PHI.**
   The script processes data locally. You never need to see patient-level data.
2. **NEVER read or display the mapping file contents.** It contains original PHI values.
3. **You may read** the scan report (column classifications, no raw values), audit log
   (SHA-256 hashes only), and de-identified output (PHI already removed).
4. **Always communicate in the user's preferred language** about the process, but use
   English for technical terms (PHI, HIPAA, Safe Harbor, etc.).

## Reference Files

- `${CLAUDE_SKILL_DIR}/references/hipaa_18_identifiers.md` — HIPAA Safe Harbor checklist
- `${CLAUDE_SKILL_DIR}/references/korean_phi_patterns.md` — Korean-specific regex patterns
- `${CLAUDE_SKILL_DIR}/references/date_shift_guide.md` — Date shifting best practices

Read relevant references before advising the researcher.

## Prerequisites

- Python 3.10+
- `openpyxl` (for .xlsx files): `pip install openpyxl`
- Supported formats: CSV, TSV, Excel (.xlsx)

## Five-Phase Workflow

### Phase 1: Assessment

Ask the researcher:
1. What file format is the data? (CSV, Excel, etc.)
2. What PHI do you expect in the data? (names, dates, IDs, etc.)
3. Does your IRB require specific de-identification documentation?
4. Do you need to re-identify later? (affects mapping file choice)

Based on answers, recommend the appropriate command:
- Full pipeline (most common): `python deidentify.py full <file> --locale <code>`
- Step-by-step (cautious): `python deidentify.py scan <file> --locale <code>` first

Available locale codes: `kr` (Korea), `us` (USA), `jp` (Japan), `cn` (China), `de` (Germany),
`uk` (United Kingdom), `fr` (France), `ca` (Canada), `au` (Australia), `in` (India).
If `--locale` is omitted, the script shows an interactive country selection menu.
Users can provide a custom locale file via `--locale-file custom.json`.

### Phase 2: Script Execution

Guide the researcher to run the script. The script is located at:
```
${CLAUDE_SKILL_DIR}/deidentify.py
```

**Full pipeline** (recommended for most users):
```bash
python ${CLAUDE_SKILL_DIR}/deidentify.py full data.xlsx \
    --locale kr \
    --output-dir ./deidentified/ \
    --auto-accept-safe
```

**Step-by-step** (for careful review):
```bash
# Step 1: Scan
python ${CLAUDE_SKILL_DIR}/deidentify.py scan data.xlsx --locale kr --output-dir ./deidentified/

# Step 2: Review (interactive)
python ${CLAUDE_SKILL_DIR}/deidentify.py review ./deidentified/scan_report.json

# Step 3: Apply
python ${CLAUDE_SKILL_DIR}/deidentify.py apply ./deidentified/reviewed_report.json
```

**Options:**
- `--locale CODE`: Country locale for PHI patterns (kr, us, jp, cn, de, uk, fr, ca, au, in)
- `--locale-file PATH`: Custom locale JSON file (copy `locales/_template.json` to create one)
- `--auto-accept-safe`: Skip confirmation for columns classified as SAFE (faster for large datasets)
- `--hash-mapping`: Store SHA-256 hashes instead of original values in mapping file (one-way, more secure)
- `--output-dir`: Where to save de-identified file, mapping, and audit log
- `-v/--verbose`: Enable debug logging

### Phase 3: Interactive Review Guidance

The script's terminal review has three passes:

1. **Pass 1 — Column Classification**: Each column is shown as PHI / REVIEW_NEEDED / SAFE.
   The researcher confirms or overrides each classification.
2. **Pass 2 — Undecided Items**: Columns that weren't resolved in Pass 1 get a second look
   with more sample values displayed.
3. **Pass 3 — Final Summary**: A table of all planned actions. The researcher can edit
   individual decisions before confirming.

Coach the researcher. Deliver these prompts in the researcher's preferred language:
- "Columns classified as PHI are anonymized by default. Press 'k' to keep the original value."
- "REVIEW_NEEDED are columns the script could not classify. Check the sample values and decide."
- "SAFE means no PHI detected. Press 'r' to request re-review if any column looks suspicious."

### Phase 4: Verify and Document

After the script completes, help the researcher verify:

1. **Read the audit log** (safe — contains only hashes):
   ```bash
   cat ./deidentified/audit_log.csv | head -20
   ```
   Verify the number of changes, affected columns, and PHI types.

2. **Spot-check the de-identified file** (safe — PHI already removed):
   Read a few rows to confirm pseudonyms (P0001, etc.), date shifts, and [REDACTED] markers
   appear where expected.

3. **Check that sensitive columns are actually removed**:
   Verify no original names, phone numbers, or RRN values remain.

4. **Mapping file security**:
   - Remind the researcher: "mapping.json contains original patient identifiers — treat it as restricted."
   - Recommend storing it separately from the de-identified data
   - File permissions are automatically set to 0600 (owner-only)

### Phase 5: Documentation

Generate a de-identification methods paragraph for the manuscript or IRB:

Template:
> Protected health information was removed from the dataset prior to analysis using
> a rule-based de-identification tool (deidentify.py, medsci-skills) with the [COUNTRY]
> locale pattern pack. The tool scanned column names and cell values using regex patterns
> for country-specific identifiers (e.g., national ID numbers, phone numbers), email
> addresses, dates, and addresses. Each column classification was reviewed by the
> researcher in an interactive terminal session. Names were replaced with pseudonyms
> (P0001, P0002, ...), dates were shifted by a random per-patient offset (±365 days)
> preserving relative temporal intervals, and direct identifiers (phone numbers, email
> addresses
skillsSkill
academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

>

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study type classification → visualization → strategy report.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

>

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 32 guidelines including STROBE, CONSORT, STARD, STARD-AI, TRIPOD, TRIPOD+AI, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, CLAIM, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.