datachain-knowledge
The datachain-knowledge skill maintains a persistent knowledge base at `dc-knowledge/` for tracking datasets, storage operations, and data pipelines across S3, GCS, Azure Blob, and local storage. Use this skill whenever creating datasets, saving pipeline outputs, running data scripts, or referencing cloud storage buckets, following the four-layer CAST methodology with strict rules around paths, authentication, and DataChain API usage.
git clone --depth 1 https://github.com/datachain-ai/datachain /tmp/datachain-knowledge && cp -r /tmp/datachain-knowledge/src/datachain/skill/knowledge ~/.claude/skills/datachain-knowledgeSKILL.md
Maintain a knowledge base at `dc-knowledge/`. `.md` files are the persistent
output. `.json` files are intermediate (generated in Step 3, consumed in
Step 4, then deleted).
`CAST.md` (sibling to this file) is the canonical methodology — the four
layers, naming + tagging, layer-ladder planning, calibration, dialogue
template, reuse rules, methodology transmission. Mode B reads it in full
as a precondition. When something methodology-related needs to change,
change `CAST.md`, not this file.
## Critical Rules
`CAST.md` §6 owns the CAST-doctrine rules (follow CAST, never bypass
DataChain, C/A/S substrate mandatory, one script per stage, one
`.save()` per script). The rules below are operational additions unique
to this skill.
1. **Path is `dc-knowledge/`** — NOT `.datachain/`. The `.datachain/` directory is the internal database; the knowledge base lives at `dc-knowledge/`.
2. **Never pass `update=True`** to `dc.read_storage()` in Task or exploration code unless the user explicitly asks to refresh the listing. L1/L2/L3 build scripts are the exception (`CAST.md` §5).
3. **Prefer DataChain operations** over plain Python for all metadata analysis.
4. **Bounded output** — JSON and markdown files stay small regardless of data size.
5. **Stop on auth/connection errors** — `bucket_scan.py` runs a fast access check. If it exits with an error JSON on stderr, **stop immediately** and show the error to the user. Do not retry with different regions, profiles, or endpoints — ask for the missing credentials.
6. **Follow the enrichment prompt template literally** in Step 4. Downstream tooling (`render_index.py`, `cast_layer` resolution) parses the exact frontmatter the prompt prescribes.
## Common gotchas in UDF scripts
- **`parallel=N` vs `workers=N`.** `parallel=N` is local multiprocessing (works anywhere). `workers=N` is Studio-only and MUST be guarded: `chain = chain.settings(parallel=N); if dc.is_studio(): chain = chain.settings(workers=N)`.
- **No `from __future__ import annotations` in UDF modules.** It stringifies type hints and DataChain's signal-schema resolution rejects the string-vs-class mismatch.
- **Type the UDF return precisely.** `Iterator[object]` / `Iterator[Any]` / bare `dict` fail schema resolution. Return a specific `Iterator[T]`, a Pydantic `BaseModel`, or a primitive.
- **Generators aren't subscriptable.** Iterators returned by file APIs do not support `[:N]`. Use `enumerate` + `break`, or `list(...)` only when the result is genuinely small.
- **Use `datachain.__version__` to get the package version** (e.g. `dc.__version__`).
---
## Workflow Mode Detection
**Mode A — Discovery/Exploration** (e.g., "what datasets exist", "show schema", "explore bucket"):
→ If the user references a specific bucket URI, run **Step 1** (Bucket Enlistment) for its root first.
→ Then run Steps 2–7.
**Mode B — Dataset Creation/Pipeline** (e.g., "create dataset X from ...", "process files and save"):
> **Precondition (do this FIRST — before ANY tool call):**
>
> $ cat dc-knowledge/index.md
> $ cat {skill_dir}/CAST.md
>
> If `index.md` exists and the task can be solved by reading an existing
> dataset, do not write a pipeline — read it directly with
> `dc.read_dataset("name")` and filter/merge/extend from there. This avoids
> recomputing expensive operations.
>
> `CAST.md` drives every layer / scope / shape decision. Re-read on each
> new task so the layer-ladder walk and dialogue template are in working
> context when you plan.
>
> **Never parse files under `dc-knowledge/datasets/*.json` or
> `dc-knowledge/buckets/**/*.json` directly** — those are pre-render
> intermediates that get deleted. The information you need is in `index.md`.
>
> If `dc-knowledge/index.md` does not exist, proceed with Steps 1–7 to build it.
→ **If the pipeline reads from a bucket**, run **Step 1** (Bucket Enlistment) for the bucket root first.
→ **Run the access check** (if not already done in Step 1): `datachain bucket status <uri>`. If `not found` / `denied`, stop and ask for credentials.
→ Read `{skill_dir}/../core/SKILL.md` for DataChain SDK rules.
→ Follow `CAST.md` §4 (planning) and §4.10 (dialogue) before writing pipeline code.
→ **While the pipeline is running**, enrich any Step 1 bucket JSON that does not yet have a `.md` (parallel work).
→ After the pipeline completes, run Steps 2–7 to update the knowledge base.
→ Report both: pipeline result AND knowledge base update status.
**Mode C — Script Execution** (e.g., user runs an existing `.py` file that touches data):
→ If the script references bucket URIs, run **Step 1** for each bucket root first.
→ Scripts can create datasets as side effects.
→ **While the script is running**, enrich Step 1 bucket JSON in parallel.
→ After ANY data-related script finishes, run Steps 2–7 to detect and record new/changed datasets.
**Mode D — Knowledge Base Maintenance** (e.g., "update the knowledge base", "refresh dataset docs"):
→ Run Steps 2–7. Existing session context in `.md` files is preserved automatically during re-enrichment.
---
## Step 1 — Bucket Enlistment
When any storage URI is encountered, enlist the whole bucket first.
1. **Extract bucket root.** From any URI, derive `{scheme}://{bucket}/`.
2. **Check if already enlisted.** Look for `dc-knowledge/buckets/{scheme}/{bucket_slug}.md` or `.json`. If either exists, skip.
3. **Access check.** Run `datachain bucket status {root_uri}`. If denied / not found, stop and ask.
4. **Scan with timeout.** Default 60s; user can override:
```bash
python3 {skill_dir}/scripts/bucket_scan.py {root_uri} \
--output dc-knowledge/buckets/{scheme}/{bucket_slug}.json --timeout 60
```
5. **Handle timeout** (exit code 124). Run the hierarchical fallback:
```bash
python3 {skill_dir}/scripts/bucket_overview.py {root_uri} \
--bucket-json dc-knowledge/buckets/{scheme}/{bucket_slug}.json
```
6. **Report.** "Enlisted bucket {bucket} — {N} files, total size {size}, primarily {top 2-3 extensions}." Do **not**Use ONLY for abstract DataChain SDK questions — API usage, method signatures, or code patterns — when no specific dataset or bucket is referenced. If the request mentions creating, saving, listing, exploring datasets or buckets, use datachain-knowledge instead.
Use when asked about Studio job analytics — compute hours, user spend, failure rates, cost estimation, cluster usage. Generates and maintains dc-knowledge/jobs/index.md.