datachain-jobs
The datachain-jobs skill maintains a jobs analytics index file that tracks Studio job metrics including compute hours, user spend, failure rates, cost estimation, and cluster usage. Use this skill when users ask questions about job performance analytics, cost breakdowns, cluster utilization, or when they need historical job data summarized and stored for reference in the dc-knowledge directory.
git clone --depth 1 https://github.com/datachain-ai/datachain /tmp/datachain-jobs && cp -r /tmp/datachain-jobs/src/datachain/skill/jobs ~/.claude/skills/datachain-jobsSKILL.md
You are now loaded with the datachain-jobs skill. Maintain a jobs analytics file at `dc-knowledge/jobs/index.md`. Follow the 3-step flow below exactly.
---
## Step 1 — Check Staleness
```
python3 {skill_dir}/scripts/jobs.py --plan
```
- If `"studio_available": false` → report the `error` message and stop.
- If `"up_to_date": true` → skip to Step 3.
- If `"up_to_date": false` → continue to Step 2.
---
## Step 2 — Fetch & Write
```
python3 {skill_dir}/scripts/jobs.py --fetch [--days N] [--limit N] [--enrich]
```
- Use `--days N` from the user's request if stated (e.g. "last 7 days" → `--days 7`). Default: `--days 30`.
- Add `--enrich` only when the question requires duration, workers, or cluster data AND `enriched: false` in an existing index — tell the user it makes one API call per terminal job.
- If the script fails → report the error and stop.
Write `dc-knowledge/jobs/index.md` using EXACTLY this format:
```markdown
---
generated: <generated from script output>
days_covered: <days_covered>
total_jobs: <filtered_count>
failed_count: <failed_count>
complete_count: <complete_count>
running_count: <running_count>
other_count: <other_count>
enriched: <true|false>
duration_note: "Wall-clock duration (submit→finish). Null when enriched=false or job still running."
truncated: <true|false>
---
## Clusters
| Name | Cloud | Max Workers | Default |
|------|-------|-------------|---------|
| <name> | <cloud_provider> | <max_workers> | <yes if is_default else no> |
## Jobs
| Date | ID | Name | Status | User | Workers | Duration | Cluster | Python |
|------|----|------|--------|------|---------|----------|---------|--------|
| <created_display> | <id> | <name> | <status> | <created_by> | <workers> | <duration_str or —> | <cluster_name or —> | <python_version or —> |
```
**Section rules:**
- Omit `## Clusters` if the `clusters` array is empty.
- Duration cell: `duration_str` value (e.g. `"9000s"`) when known, `—` when null.
- Workers: always a number (`workers` field, defaults to 1).
- Cluster, Python: use `—` when null.
- Date column: `created_display` (`YYYY-MM-DD HH:MM` UTC).
- Rows: newest-first (already sorted by script).
- If `truncated: true`, add after the table: `_(Results truncated at <limit> jobs. Use --limit N for more.)_`
---
## Step 3 — Answer
Read `dc-knowledge/jobs/index.md` and answer the user's question.
### Duration arithmetic
Duration cells contain plain seconds strings like `"9000s"`. Parse the integer before `s`, sum, then convert:
- Example: filter rows for user "alice" in the last 7 days, sum all Duration values → total seconds → divide by 3600 for hours.
- If all Duration cells are `—` (enriched: false) → say: "Duration data requires enrichment. Re-fetch with: `python3 {skill_dir}/scripts/jobs.py --fetch --enrich`" and offer to do so.
### Failure rate
- Overall: `failed_count / total_jobs * 100` from frontmatter.
- Per user or per day: count rows matching Status = `failed` in the table.
### Price estimation
When the user asks for cost:
1. If hourly rate unknown → ask: "What is the instance hourly rate in $/hr? (e.g. `3.20` for $3.20/hr)"
2. If Workers column is all `—` → ask: "How many workers per job?" or compute single-worker cost and note it.
3. Compute per job: `duration_seconds / 3600 × rate × workers`. Group by user/day/cluster as requested.
4. Present as a table: User | Compute-hours | Est. cost (@$X/hr × N workers)
### Per-cluster / per-user analytics
Filter the Jobs table by the Cluster or User column. Aggregate `(Ns)` Duration values for totals.Use ONLY for abstract DataChain SDK questions — API usage, method signatures, or code patterns — when no specific dataset or bucket is referenced. If the request mentions creating, saving, listing, exploring datasets or buckets, use datachain-knowledge instead.
Use whenever datasets, cloud storage buckets, or data pipelines are mentioned — creating, saving, querying, listing, exploring, deleting, or processing data in S3, GCS, Azure Blob, or local storage. Also use when running any script that may create datasets as a side effect. Maintains a knowledge base at dc-knowledge/ (JSON + markdown). ALWAYS use this skill when the user creates a dataset, saves pipeline output, runs a data script, or references any storage bucket.