dataset-profiler
dataset-profiler generates a structured markdown report that documents a dataset's schema, missing values, statistical distributions, outliers, and data quality issues before analysis begins. Use this skill when encountering unfamiliar data, validating data quality problems, or preparing for reproducible analysis workflows with CSV, Parquet, or JSONL files.
git clone --depth 1 https://github.com/oxbshw/LLM-Agents-Ecosystem-Handbook /tmp/dataset-profiler && cp -r /tmp/dataset-profiler/skills/catalog/dataset-profiler ~/.claude/skills/dataset-profilerSKILL.md
# Dataset Profiler
## When to use
- A new dataset arrives and you need to understand it before using it
- Before reproducing an analysis that referenced a dataset
- When data quality is suspect ("the chart looked wrong")
## When NOT to use
- Streaming / online data (this is point-in-time)
- Sensitive PII without an explicit allow-list
## Inputs
| Name | Type | Required | Notes |
|---|---|---|---|
| `path` | path | yes | CSV / Parquet / JSONL |
| `target` | string | no | column of interest (gets extra distribution detail) |
## Outputs
`profile.md` with: Source, Schema, Missingness, Distributions, Outliers, Joins / keys, Gotchas, Open questions.
## Workflow
1. Load with the right reader (extension-detected); record row count, file size
2. Schema: column → dtype → nullable → example value
3. Missingness: % per column, top columns by missingness
4. Distributions: numeric (min, p50, p95, max, std), categorical (top-k, cardinality)
5. Outliers: flag rows beyond p99 + 3·IQR for numerics
6. Identify potential keys (unique columns) and join candidates
7. **Gotchas**: timezone columns, mixed encodings, suspicious all-zero rows, magic values (`-1`, `9999-12-31`)
8. **Open questions**: ambiguous columns / values that need owner input
## References
- [`references/profile-template.md`](references/profile-template.md)
## Success criteria
- Every column appears in Schema + Missingness
- Outliers section includes example rows
- Gotchas section is non-empty (real datasets always have some)
## Failure modes
- File too large to read in memory → switch to streaming + sampled stats; flag prominently
- Encoding fails → try common alternatives; if all fail, surface and stopUse when capturing an architecture decision so it survives turnover — produces an ADR-NNNN.md from context, options considered, and the chosen path.
Use when reviewing a proposed REST or GraphQL API change before merge — checks contract clarity, backwards compatibility, errors, pagination, auth, and naming.
Use after an incident is resolved — drafts a blameless postmortem from timeline notes, alerts, and chat threads.
Use when opening a PR — produces a clean PR description (what / why / how to verify / risks) from a branch diff against base.
Use when planning the next sprint — turns ticket intake + team capacity into a planned sprint with explicit non-goals.
Use after a session to promote useful episodic notes from logs/episodic/ into distilled, dated entries in MEMORY.md and memory/semantic/.
Use before connecting a new MCP server to your agent — produces a structured security review covering source, permissions, tools, network, and approvals.