Skip to main content
ClaudeWave
Skill2.8k estrellas del repoactualizado today

datachain-core

**datachain-core** is a Claude Code skill containing expert-level guidance on DataChain SDK mechanics, including API usage, method signatures, UDF patterns, materialization strategies, and code generation rules. Use it for abstract questions about how to write correct DataChain Python code when no specific dataset or bucket exploration is involved; defer methodology questions about dataset design and layer selection to the datachain-knowledge skill.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/datachain-ai/datachain /tmp/datachain-core && cp -r /tmp/datachain-core/src/datachain/skill/core ~/.claude/skills/datachain-core
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

You are now loaded with expert-level DataChain SDK context. Apply every rule below when generating DataChain Python code.

## Scope of this skill

This file is SDK mechanics — how to write DataChain code that runs correctly: API usage, UDF signatures, settings, delta semantics, materialization patterns, saving, exporting.

**It does not own methodology.** Decisions about *which* datasets to build, what scope, what shape (Container / Asset / Sense / Task), what fields to save, and when to dialogue with the user about layer choices — those are the CAST methodology, which lives in the **datachain-knowledge** skill at `{knowledge_skill_dir}/CAST.md`.

When knowledge is loaded, it is the orchestrator: it plans the layers (CAST §4), invokes the rules in this file to write the code, then runs the KB pipeline. When knowledge is *not* loaded (raw SDK use, no `dc-knowledge/` directory), this file is self-sufficient — CAST doctrine simply does not apply.

If you find yourself reasoning about "should I build a Sense layer here?" or "should this be scoped to the bucket or the directory?" from inside this file, stop — those questions belong upstream. Ask the user to load the knowledge skill, or fall through to a direct solve.

## Pre-Generation Checklist

- [ ] **Every UDF has a known output type.** Functions passed to `.map()`, `.gen()`, or `.agg()` must have their return type resolved. See §2 Rule 2 — the #1 runtime error.
- [ ] **No `from __future__ import annotations` in UDF modules.** It stringifies type hints; DataChain's signal-schema resolution then rejects the string-vs-class mismatch.
- [ ] **Bucket access: anonymous or authenticated?** Check `dc-knowledge/buckets/` for a `.md` file with `anon: true/false` in frontmatter. If none, run `datachain bucket status <uri>` to detect. If `denied` or `not found`, stop and ask the user.
- [ ] **Heavy-init resources load via `.setup()`**, not module-level lazy globals:
  ```python
  chain.setup(model=lambda: load_model()).map(result=run_model)
  ```
  Lazy globals leak across `parallel=N` workers and hide the dependency from the chain definition. See §2 Rule 20.
- [ ] **`.settings(parallel=N)` is the right tool only when the workload benefits.** See §2 Rule 6.

---

## Section 1 — Dataset Reuse (Highest Priority)

**Before writing any pipeline code, check what already exists.**

1. If `dc-knowledge/index.md` exists, read it **first**.
2. When the user's task overlaps with an existing dataset, read its `.md` under `dc-knowledge/datasets/` for schema, code patterns, and lineage.
3. **Reuse over rebuild.** Start from an existing dataset (`dc.read_dataset("name")`) whenever it covers the data the user needs — even partially. Filter, merge, or extend it instead of re-reading raw storage.

Only go to raw storage when no existing dataset covers the needed data, or the user explicitly asks to start fresh.

### Dataset-first reasoning

Datasets are the unit of reasoning. Chains that transform data through UDFs — or that produce a pipeline's final result — should be saved as named datasets.

**Core rule: always `.save()`, never just `.show()`.** A pipeline's terminal operation is `.save("descriptive_name")`, followed by `.show()` on the saved result for display. Two exceptions: (1) one-off exploratory queries where the user explicitly asks to "show me" or "print"; (2) Task-layer outputs per the CAST methodology — persist by exception, not by default. The always-save rule is absolute for C/A/S substrate layers.

**Critical anti-pattern: bypassing `.save()` by dumping in-memory rows to a file.** Reading the chain via `.to_list()` / `.to_values()` and writing to disk via `open()`, `json.dump`, `pandas.to_csv`, or any Python-side file handle is forbidden for UDF-bearing pipelines. The pipeline result must land as a saved dataset first via `.save()`. Once saved, exporting via `chain.to_csv()`, `chain.to_parquet()`, `chain.to_storage()` is fine.

**Not a bypass:** a UDF that materializes a payload to storage and returns a `dc.File` pointer. The dataset still lands via `.save()`; the file in storage is the row's payload, owned by DataChain via the pointer.

```python
# ✗ ANTI-PATTERN — UDF result pulled into Python and dumped to disk.
results = chain.map(emb=encode_image).to_list("file", "emb")
with open("similar_results.json", "w") as f:
    json.dump(results, f)

# ✓ Save the dataset first, then export from it if needed.
saved = chain.map(emb=encode_image).save("product_catalog_embeddings", attrs=[...])
saved.to_csv("similar_results.csv")
```

**What to save — the UDF rule:**

- **Any chain that runs a UDF (`.map()`, `.gen()`, `.agg()`)** must be saved with `.save("name")`. UDFs embody domain logic and produce structured output worth preserving.
- **Final pipeline results.** Rankings, filtered cohorts, evaluation outputs, aggregations — always `.save("name")`.
- **Chains with no UDFs** (`read_storage` + `filter`/`mutate`/`select` only) may remain transient — cheap to recompute, easy to read from the code.

**Prompt-trigger keywords for `.save()`.** When the user's task description contains "make available for downstream queries", "compute per-X aggregates", "build / extract / produce X", "store / persist / materialize / save", "process and save" — call `.save("name")` and print a short summary (name + row count or a few stats), not the full result set.

**`.persist()` is not `.save()`.** `.persist()` materializes a chain into an anonymous dataset — it prevents re-execution but creates no named dataset. When a chain should be saved per the rules above, use `.save("name")`.

### Code-level decomposition: one stage = one script

A multi-stage pipeline that produces multiple named datasets through expensive stages (LLM calls, embeddings, ML inference) belongs in MULTIPLE scripts — one per stage — not folded into one monolith. Each script reads from the previous stage's saved dataset via `dc.read_dataset(...)` and writes its own with `.save("name")`.

**Split when ANY**