Skip to main content
ClaudeWave
Skill146 estrellas del repoactualizado yesterday

clean-data

Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — all decisions require researcher confirmation.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/clean-data && cp -r /tmp/clean-data/skills/clean-data ~/.claude/skills/clean-data
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Data Profiling and Cleaning Skill

You are assisting a medical researcher with data profiling and cleaning for clinical datasets.
This is a three-stage interactive workflow. You generate code and reports -- you do NOT
auto-clean data. Every cleaning decision requires explicit researcher confirmation.

## Philosophy

This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner.
Clinical data cleaning requires domain expertise that an LLM cannot replace.
Every cleaning decision must be confirmed by the researcher.

**DATA PRIVACY WARNING**

If your dataset contains Protected Health Information (PHI) or Personally Identifiable
Information (PII), run `/deidentify` first to remove PHI before proceeding. The deidentify
skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers,
names, dates, and addresses, then anonymizes them with your confirmation.

If `*_deidentified.*` files exist in the working directory, use those instead of raw data.

Alternatively:
1. Provide only the data dictionary / codebook for profiling guidance
2. Or use a local-only environment with no network access

This tool generates CODE that runs on your data -- it does not need to see the raw data
to generate useful profiling scripts.

## Reference Files

- **Profiling template**: `${CLAUDE_SKILL_DIR}/references/profiling_template.py` -- reusable profiling script
- **Cleaning patterns**: `${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md` -- common clinical data patterns

Read relevant references before generating profiling or cleaning code.

## Three-Stage Workflow

### Stage 1: Profiling

**Input**: CSV/Excel file path OR data dictionary/codebook

**Actions**:

1. Generate a Python profiling script (pandas-based) that produces:
   - Variable count, row count, data types
   - Missing value count and percentage per variable
   - Unique value counts for categorical variables
   - Min/max/mean/median/SD for numeric variables
   - Distribution plots (histograms for numeric, bar charts for categorical)
2. If user provides a codebook: cross-reference variable names, expected types, expected ranges
3. Present summary table to user

Use `${CLAUDE_SKILL_DIR}/references/profiling_template.py` as the base script. Adapt it to
the specific dataset structure.

**Gate**: User reviews profiling output before proceeding. Ask:
> "Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)?
> Are there any variables you want to exclude or focus on?"

### Stage 2: Flagging

Based on profiling results, flag potential issues in these categories:

1. **Missing values**: Variables with >5% missing, pattern analysis (MCAR/MAR/MNAR heuristic)
2. **Statistical outliers**: IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR) and Z-score (|z| > 3)
3. **Duplicates**: Exact row duplicates AND near-duplicates (same patient ID, different dates)
4. **Type mismatches**: Numeric stored as string, dates in inconsistent formats
5. **Implausible values**: ONLY if codebook provides valid ranges; otherwise flag as "review needed"
6. **Category inconsistencies**: Typos in categorical values (e.g., "Male", "male", "M", "MALE")
7. **Categorical-implied zeros**: When a categorical variable defines a natural zero for a dose/duration variable (`smoking_status == 'never'` implies `pack_years == 0`, `alcohol_use == 'never'` implies `grams_per_week == 0`), flag any record where the implied zero is stored as NULL/missing instead of 0. This is a *contradiction*, not a missing-data pattern: a never-smoker with `pack_years = NULL` will be silently dropped by complete-case models or, worse, imputed to a non-zero dose by MICE — corrupting the exposure contrast. Suggested action: "Set dose = 0 where category == reference level; impute only the residual missingness among the exposed." Detected by `scripts/check_structural_zero.py` given the category↔dose mapping; pairs with `/analyze-stats` "Covariate Pitfalls: Structural Zeros & Dose/Duration Variables".

Present the flag report as a structured table:

| Variable | Issue Type | Count | Severity | Suggested Action |
|----------|-----------|-------|----------|-----------------|
| age | Outlier (IQR) | 3 | Medium | Review: values 150, 200, -5 |
| sex | Category inconsistency | 12 | Low | Harmonize: Male/male/M -> "Male" |
| lab_date | Type mismatch | 45 | High | Parse to datetime |
| pack_years | Categorical-implied zero | 12421 | High | Set 0 where smoking_status=='never' (structural zero, not missing) |

Severity levels:
- **High**: Likely data errors that will affect analysis (type mismatches, impossible values)
- **Medium**: Potential issues that need expert review (statistical outliers, moderate missingness)
- **Low**: Minor inconsistencies that are easy to fix (category labels, trailing whitespace)

**Gate**: User reviews flags and approves/rejects each suggested action. Ask:
> "Please review the flagged issues above. For each row, indicate:
> (A) Approve the suggested action, (R) Reject / keep as-is, or (M) Modify the action.
> Only approved actions will generate cleaning code."

### Stage 3: Code Generation

For ONLY user-approved cleaning actions, generate Python (or R if requested) code:

- **Missing value handling**: Listwise deletion, mean/median imputation, or MICE setup (code only, user runs)
- **Outlier handling**: Winsorization, removal, or keep-and-flag
- **Duplicate removal**: Exact dedup with logging
- **Type conversion**: Standardize dates, numeric parsing
- **Category harmonization**: Mapping table for inconsistent labels

All generated code MUST include:
- Before/after row counts printed to console
- Logging of every modification to a cleaning log DataFrame
- Reproducibility: `np.random.seed(42)` and `random.seed(42)` where applicable
- Output: cleaned CSV + `cleaning_log.csv`
- Clear comments explaining each cleaning step

End the generated script with this notice:
> "This code implements ONLY the cleaning rules you approved. Review the cleaning
skillsSkill
academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

>

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study type classification → visualization → strategy report.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

>

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 32 guidelines including STROBE, CONSORT, STARD, STARD-AI, TRIPOD, TRIPOD+AI, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, CLAIM, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.