Skip to main content
ClaudeWave

The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure

Subagents2.8k stars145 forksPythonApache-2.0Updated today
Editor's note

DataChain is a Python library and MCP server that turns unstructured files stored in S3, GCS, and Azure into typed, versioned datasets queryable at warehouse speed, without copying bytes out of storage. Its three main components are a Compute Engine for parallel and distributed Python processing with async I/O and checkpoint recovery, a Dataset DB backed by Pydantic schemas that tracks versions, file pointers, and lineage in a local SQLite store, and a Knowledge Base that generates markdown summaries from datasets enriched by LLM. The Agent Harness connects all three to Claude Code, Cursor, Codex, GitHub Copilot, and Pi via a single install command such as `datachain skill install --target claude`, exposing tools like `read_storage`, `map`, and `save` so agents can build and reuse named dataset versions like `pets_embeddings@1.0.0` across sessions. Data engineers, MLOps practitioners, and teams running multimodal pipelines benefit most, particularly when agents need persistent data context rather than recomputing from raw files on every run.

ClaudeWave Trust Score
100/100
Verified
Passed
  • Open-source license (Apache-2.0)
  • Actively maintained (<30d)
  • Healthy fork ratio
  • Clear description
  • Topics declared
  • Mature repo (>1y old)
Flags
  • !README contains suspicious pattern: eval\s*\(
Last scanned: 6/11/2026
Install as a Claude Code subagent
Method: Clone
Terminal
git clone https://github.com/datachain-ai/datachain && cp datachain/*.md ~/.claude/agents/
1. Clone the repository and copy the agent .md definitions into ~/.claude/agents (or .claude/agents inside a project).
2. Start a new Claude Code session to load the agents.
3. Delegate work to them with the Task/Agent tool or by name.

3 items in this repository

Use ONLY for abstract DataChain SDK questions — API usage, method signatures, or code patterns — when no specific dataset or bucket is referenced. If the request mentions creating, saving, listing, exploring datasets or buckets, use datachain-knowledge instead.

Install

Use when asked about Studio job analytics — compute hours, user spend, failure rates, cost estimation, cluster usage. Generates and maintains dc-knowledge/jobs/index.md.

Install

Use whenever datasets, cloud storage buckets, or data pipelines are mentioned — creating, saving, querying, listing, exploring, deleting, or processing data in S3, GCS, Azure Blob, or local storage. Also use when running any script that may create datasets as a side effect. Maintains a knowledge base at dc-knowledge/ (JSON + markdown). ALWAYS use this skill when the user creates a dataset, saves pipeline output, runs a data script, or references any storage bucket.

Install
Use cases

Subagents overview

README preview not available. Visit the repo on GitHub for full documentation.
ai-agentsclaude-codecodexdata-context-layerdata-processingharness-engineeringknowledge-basemlopsmultimodalpydanticunstructured-data

What people ask about datachain

What is datachain-ai/datachain?

+

datachain-ai/datachain is subagents for the Claude AI ecosystem. The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure It has 2.8k GitHub stars and was last updated today.

How do I install datachain?

+

You can install datachain by cloning the repository (https://github.com/datachain-ai/datachain) or following the README instructions on GitHub. ClaudeWave also provides quick install blocks on this page.

Is datachain-ai/datachain safe to use?

+

Our security agent has analyzed datachain-ai/datachain and assigned a Trust Score of 100/100 (tier: Verified). See the full breakdown of passed checks and flags on this page.

Who maintains datachain-ai/datachain?

+

datachain-ai/datachain is maintained by datachain-ai. The last recorded GitHub activity is from today, with 66 open issues.

Are there alternatives to datachain?

+

Yes. On ClaudeWave you can browse similar subagents at /categories/agents, sorted by popularity or recent activity.

Deploy datachain to your cloud

Ship this repo to production in minutes. Each platform spins up its own environment with editable env vars.

Maintain this repo? Add a badge to your README

Drop the badge into your GitHub README to show it's tracked on ClaudeWave. Each badge links back to this page and reflects the live Trust Score.

Featured on ClaudeWave: datachain-ai/datachain
[![Featured on ClaudeWave](https://claudewave.com/api/badge/datachain-ai-datachain)](https://claudewave.com/repo/datachain-ai-datachain)
<a href="https://claudewave.com/repo/datachain-ai-datachain"><img src="https://claudewave.com/api/badge/datachain-ai-datachain" alt="Featured on ClaudeWave: datachain-ai/datachain" width="320" height="64" /></a>

More Subagents