The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure
DataChain is a Python library and MCP server that turns unstructured files stored in S3, GCS, and Azure into typed, versioned datasets queryable at warehouse speed, without copying bytes out of storage. Its three main components are a Compute Engine for parallel and distributed Python processing with async I/O and checkpoint recovery, a Dataset DB backed by Pydantic schemas that tracks versions, file pointers, and lineage in a local SQLite store, and a Knowledge Base that generates markdown summaries from datasets enriched by LLM. The Agent Harness connects all three to Claude Code, Cursor, Codex, GitHub Copilot, and Pi via a single install command such as `datachain skill install --target claude`, exposing tools like `read_storage`, `map`, and `save` so agents can build and reuse named dataset versions like `pets_embeddings@1.0.0` across sessions. Data engineers, MLOps practitioners, and teams running multimodal pipelines benefit most, particularly when agents need persistent data context rather than recomputing from raw files on every run.
- ✓Open-source license (Apache-2.0)
- ✓Actively maintained (<30d)
- ✓Healthy fork ratio
- ✓Clear description
- ✓Topics declared
- ✓Mature repo (>1y old)
- !README contains suspicious pattern: eval\s*\(
git clone https://github.com/datachain-ai/datachain && cp datachain/*.md ~/.claude/agents/3 items in this repository
Use ONLY for abstract DataChain SDK questions — API usage, method signatures, or code patterns — when no specific dataset or bucket is referenced. If the request mentions creating, saving, listing, exploring datasets or buckets, use datachain-knowledge instead.
Use when asked about Studio job analytics — compute hours, user spend, failure rates, cost estimation, cluster usage. Generates and maintains dc-knowledge/jobs/index.md.
Use whenever datasets, cloud storage buckets, or data pipelines are mentioned — creating, saving, querying, listing, exploring, deleting, or processing data in S3, GCS, Azure Blob, or local storage. Also use when running any script that may create datasets as a side effect. Maintains a knowledge base at dc-knowledge/ (JSON + markdown). ALWAYS use this skill when the user creates a dataset, saves pipeline output, runs a data script, or references any storage bucket.
Subagents overview
What people ask about datachain
What is datachain-ai/datachain?
+
datachain-ai/datachain is subagents for the Claude AI ecosystem. The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure It has 2.8k GitHub stars and was last updated today.
How do I install datachain?
+
You can install datachain by cloning the repository (https://github.com/datachain-ai/datachain) or following the README instructions on GitHub. ClaudeWave also provides quick install blocks on this page.
Is datachain-ai/datachain safe to use?
+
Our security agent has analyzed datachain-ai/datachain and assigned a Trust Score of 100/100 (tier: Verified). See the full breakdown of passed checks and flags on this page.
Who maintains datachain-ai/datachain?
+
datachain-ai/datachain is maintained by datachain-ai. The last recorded GitHub activity is from today, with 66 open issues.
Are there alternatives to datachain?
+
Yes. On ClaudeWave you can browse similar subagents at /categories/agents, sorted by popularity or recent activity.
Deploy datachain to your cloud
Ship this repo to production in minutes. Each platform spins up its own environment with editable env vars.
Maintain this repo? Add a badge to your README
Drop the badge into your GitHub README to show it's tracked on ClaudeWave. Each badge links back to this page and reflects the live Trust Score.
[](https://claudewave.com/repo/datachain-ai-datachain)<a href="https://claudewave.com/repo/datachain-ai-datachain"><img src="https://claudewave.com/api/badge/datachain-ai-datachain" alt="Featured on ClaudeWave: datachain-ai/datachain" width="320" height="64" /></a>More Subagents
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
The agent that grows with you
Java 面试 & 后端通用面试指南,覆盖计算机基础、数据库、分布式、高并发、系统设计与 AI 应用开发
Production-ready platform for agentic workflow development.
The agent engineering platform.
🤯 LobeHub is your Chief Agent Operator, organizing your agents into 7×24 operations by hiring, scheduling, and reporting on your entire AI team.