Skill223 estrellas del repoactualizado yesterday

clean-data

The clean-data skill provides a three-stage interactive workflow for medical researchers to profile and clean clinical datasets without automatic modifications. It generates Python profiling scripts to assess CSV/Excel data for missing values, outliers, duplicates, and type mismatches, then flags issues and generates cleaning code that requires explicit researcher confirmation at each step. Use this skill when you need structured data quality assessment for medical research while maintaining control over all cleaning decisions and ensuring privacy compliance.

Ver fuente Repositorio: medsci-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/Aperivue/medsci-skills /tmp/clean-data && cp -r /tmp/clean-data/skills/clean-data ~/.claude/skills/clean-data

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Data Profiling and Cleaning Skill

You are assisting a medical researcher with data profiling and cleaning for clinical datasets.
This is a three-stage interactive workflow. You generate code and reports -- you do NOT
auto-clean data. Every cleaning decision requires explicit researcher confirmation.

## Philosophy

This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner.
Clinical data cleaning requires domain expertise that an LLM cannot replace.
Every cleaning decision must be confirmed by the researcher.

**DATA PRIVACY WARNING**

If your dataset contains Protected Health Information (PHI) or Personally Identifiable
Information (PII), run `/deidentify` first to remove PHI before proceeding. The deidentify
skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers,
names, dates, and addresses, then anonymizes them with your confirmation.

If `*_deidentified.*` files exist in the working directory, use those instead of raw data.

Alternatively:
1. Provide only the data dictionary / codebook for profiling guidance
2. Or use a local-only environment with no network access

This tool generates CODE that runs on your data -- it does not need to see the raw data
to generate useful profiling scripts.

## Reference Files

- **Profiling template**: `${CLAUDE_SKILL_DIR}/references/profiling_template.py` -- reusable profiling script
- **Cleaning patterns**: `${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md` -- common clinical data patterns

Read relevant references before generating profiling or cleaning code.

## Three-Stage Workflow

### Stage 1: Profiling

**Input**: CSV/Excel file path OR data dictionary/codebook

**Actions**:

1. Generate a Python profiling script (pandas-based) that produces:
   - Variable count, row count, data types
   - Missing value count and percentage per variable
   - Unique value counts for categorical variables
   - Min/max/mean/median/SD for numeric variables
   - Distribution plots (histograms for numeric, bar charts for categorical)
2. If user provides a codebook: cross-reference variable names, expected types, expected ranges
3. Present summary table to user

Use `${CLAUDE_SKILL_DIR}/references/profiling_template.py` as the base script. Adapt it to
the specific dataset structure.

**Gate**: User reviews profiling output before proceeding. Ask:
> "Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)?
> Are there any variables you want to exclude or focus on?"

### Stage 2: Flagging

Based on profiling results, flag potential issues in these categories:

1. **Missing values**: Variables with >5% missing, pattern analysis (MCAR/MAR/MNAR heuristic)
2. **Statistical outliers**: IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR) and Z-score (|z| > 3)
3. **Duplicates**: Exact row duplicates AND near-duplicates (same patient ID, different dates)
4. **Type mismatches**: Numeric stored as string, dates in inconsistent formats
5. **Implausible values**: ONLY if codebook provides valid ranges; otherwise flag as "review needed"
6. **Category inconsistencies**: Typos in categorical values (e.g., "Male", "male", "M", "MALE")
7. **Categorical-implied zeros**: When a categorical variable defines a natural zero for a dose/duration variable (`smoking_status == 'never'` implies `pack_years == 0`, `alcohol_use == 'never'` implies `grams_per_week == 0`), flag any record where the implied zero is stored as NULL/missing instead of 0. This is a *contradiction*, not a missing-data pattern: a never-smoker with `pack_years = NULL` will be silently dropped by complete-case models or, worse, imputed to a non-zero dose by MICE — corrupting the exposure contrast. Suggested action: "Set dose = 0 where category == reference level; impute only the residual missingness among the exposed." Detected by `scripts/check_structural_zero.py` given the category↔dose mapping; pairs with `/analyze-stats` "Covariate Pitfalls: Structural Zeros & Dose/Duration Variables".

8. **Reverse-coded scale items**: When a multi-item Likert scale (Trust, Satisfaction, Burden, etc.) mixes positively- and negatively-worded items, every negatively-worded ("reverse") item must be recoded `(min+max) - x` *before* the scale total or Cronbach's alpha is computed. A reverse item left un-recoded correlates negatively with the rest of the scale and collapses alpha — often turning it **negative**. A negative alpha is almost never a real measurement phenomenon; it is a reverse-coding bug, and defending it as "multidimensional structure" loses a review round. Suggested action: "Recode reverse-worded items, then recompute reliability." Detected by `scripts/check_reverse_coding.py` (flags items with a negative item-rest correlation and a negative raw alpha, given the scale item columns); the recode itself is applied downstream by `/analyze-stats` `likert_summary.py --reverse-items`. Pairs with the global rule `survey-scale-reliability.md`.

Present the flag report as a structured table:

| Variable | Issue Type | Count | Severity | Suggested Action |
|----------|-----------|-------|----------|-----------------|
| age | Outlier (IQR) | 3 | Medium | Review: values 150, 200, -5 |
| sex | Category inconsistency | 12 | Low | Harmonize: Male/male/M -> "Male" |
| lab_date | Type mismatch | 45 | High | Parse to datetime |
| pack_years | Categorical-implied zero | 12421 | High | Set 0 where smoking_status=='never' (structural zero, not missing) |
| trust_E3 | Reverse-coded item (raw α=-0.57) | n/a | High | Recode (6 - x) before reliability; negative α is a coding bug |

Severity levels:
- **High**: Likely data errors that will affect analysis (type mismatches, impossible values)
- **Medium**: Potential issues that need expert review (statistical outliers, moderate missingness)
- **Low**: Minor inconsistencies that are easy to fix (category labels, trailing whitespace)

**Gate**: User reviews flags and approves/rejects each suggested action. Ask:
> "Please review the

Del mismo repositorio

skillsSkill

academic-aioSkill

Medical AI paper optimization for AI search engines (Perplexity, ChatGPT web, Elicit, Consensus, SciSpace) and RAG-based literature tools. Applies when drafting or reviewing titles, abstracts, structured summary boxes (Key Points / Research in Context / Plain-Language Summary), manuscripts for high-impact medical AI journals (Lancet Digital Health, Radiology, Radiology-AI, npj Digital Medicine, Nature Medicine), preprints (medRxiv/arXiv), GitHub README + CITATION.cff + Zenodo archives, and Hugging Face model/dataset cards. Integrates TRIPOD+AI, CLAIM 2024, STARD-AI, TRIPOD-LLM, DECIDE-AI reporting requirements with generative engine optimization (GEO) principles. Produces a visible pass/fail checklist.

add-journalSkill

analyze-statsSkill

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

author-strategySkill

PubMed author profile analysis. Author name → PubMed fetch → study-type classification → visualization → strategy report → optional trajectory-archetype classification.

batch-cohortSkill

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

calc-sample-sizeSkill

check-reportingSkill

Check manuscript compliance with medical research reporting guidelines. Supports 36 guidelines including STROBE, CONSORT, CONSORT-AI, STARD, STARD-AI, TRIPOD, TRIPOD+AI, TRIPOD-LLM, ARRIVE, PRISMA, PRISMA-DTA, PRISMA-P, CARE, SPIRIT, SPIRIT-AI, CLAIM, DECIDE-AI, MI-CLEAR-LLM, SQUIRE 2.0, CLEAR, MOOSE, GRRAS, SWiM, AMSTAR 2, and risk of bias tools (QUADAS-2, QUADAS-C, RoB 2, ROBINS-I, ROBINS-E, ROBIS, ROB-ME, PROBAST, PROBAST+AI, NOS, COSMIN, RoB NMA). Generates item-by-item assessment with PRESENT/MISSING/PARTIAL status.