Skip to main content
ClaudeWave
Skill530 estrellas del repoactualizado 1mo ago

dataset-profiler

dataset-profiler generates a structured markdown report that documents a dataset's schema, missing values, statistical distributions, outliers, and data quality issues before analysis begins. Use this skill when encountering unfamiliar data, validating data quality problems, or preparing for reproducible analysis workflows with CSV, Parquet, or JSONL files.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/oxbshw/LLM-Agents-Ecosystem-Handbook /tmp/dataset-profiler && cp -r /tmp/dataset-profiler/skills/catalog/dataset-profiler ~/.claude/skills/dataset-profiler
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Dataset Profiler

## When to use
- A new dataset arrives and you need to understand it before using it
- Before reproducing an analysis that referenced a dataset
- When data quality is suspect ("the chart looked wrong")

## When NOT to use
- Streaming / online data (this is point-in-time)
- Sensitive PII without an explicit allow-list

## Inputs
| Name | Type | Required | Notes |
|---|---|---|---|
| `path` | path | yes | CSV / Parquet / JSONL |
| `target` | string | no | column of interest (gets extra distribution detail) |

## Outputs
`profile.md` with: Source, Schema, Missingness, Distributions, Outliers, Joins / keys, Gotchas, Open questions.

## Workflow
1. Load with the right reader (extension-detected); record row count, file size
2. Schema: column → dtype → nullable → example value
3. Missingness: % per column, top columns by missingness
4. Distributions: numeric (min, p50, p95, max, std), categorical (top-k, cardinality)
5. Outliers: flag rows beyond p99 + 3·IQR for numerics
6. Identify potential keys (unique columns) and join candidates
7. **Gotchas**: timezone columns, mixed encodings, suspicious all-zero rows, magic values (`-1`, `9999-12-31`)
8. **Open questions**: ambiguous columns / values that need owner input

## References
- [`references/profile-template.md`](references/profile-template.md)

## Success criteria
- Every column appears in Schema + Missingness
- Outliers section includes example rows
- Gotchas section is non-empty (real datasets always have some)

## Failure modes
- File too large to read in memory → switch to streaming + sampled stats; flag prominently
- Encoding fails → try common alternatives; if all fail, surface and stop