llm-cost-optimizer
**llm-cost-optimizer** This skill helps teams analyze and reduce LLM expenditure on Vellum assistants by mapping individual call-site model overrides to three managed performance profiles: Balanced (Claude Sonnet), Quality (Claude Opus), and Speed (Claude Haiku). Use it to audit current spending patterns, identify which tasks can run on cheaper models without degrading quality, and systematically apply cost-optimized configurations across all call sites to prevent expensive default models from running unexpectedly.
git clone --depth 1 https://github.com/vellum-ai/vellum-assistant /tmp/llm-cost-optimizer && cp -r /tmp/llm-cost-optimizer/skills/llm-cost-optimizer ~/.claude/skills/llm-cost-optimizerSKILL.md
## Overview
This skill walks through analyzing and reducing LLM spend on a Vellum assistant. There are three layers:
1. **Provider connections** — named auth configs (e.g. `anthropic-managed`, `my-personal-key`)
2. **Model profiles** — named presets (model + effort + thinking + contextWindow). Three managed defaults: `balanced`, `quality-optimized`, `cost-optimized`.
3. **Call-site overrides** (`llm.callSites.<id>`) — per-task model/profile pinning. Falls back to `llm.default` when absent.
UI labels for the three managed profiles:
- `balanced` → **Balanced** (Sonnet, good for agent loop)
- `quality-optimized` → **Quality** (Opus, for hard tasks)
- `cost-optimized` → **Speed** (Haiku, for utility/background tasks)
### 🚨 Critical: unoverridden call sites fall back to `llm.default`
If `llm.default` is Opus (or any expensive model), **every call site without an explicit override burns that rate**. Don't rely on just patching a few overrides — use the complete turnkey blob in Step 5 to cover every call site at once.
---
## Step 1 — Understand current spend
```bash
# Weekly totals
assistant usage totals --range week
# Break down by call site (most useful — shows what's expensive)
assistant usage breakdown --group-by call_site --range week
# Break down by model
assistant usage breakdown --group-by model --range week
# Break down by profile
assistant usage breakdown --group-by inference_profile --range week
```
Check `llm.default` — if it's pointing at Opus, that's your biggest risk:
```bash
assistant config get llm.default
```
---
## Step 2 — Read current overrides
```bash
assistant config get llm.callSites
assistant config get llm.profiles
assistant inference providers connections list
```
---
## Step 3 — Recommended profile assignment
| Profile | Call Sites |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `balanced` (Sonnet) | `mainAgent`, `subagentSpawn`, `compactionAgent`, `analyzeConversation`, `patternScan`, `narrativeRefinement`, `memoryConsolidation`, `recall`, `callAgent`, `emptyStateGreeting`, `conversationStarters`, `identityIntro`, `proactiveArtifactBuild` |
| `cost-optimized` (Haiku) | **Everything else** — `memoryRouter` (with 1M context override), memory extraction/retrieval, UI copy, classifiers, summarization, background tasks |
| `quality-optimized` (Opus) | **Do not pin.** Reserved for on-demand user escalation via `/model` |
---
## Step 4 — Config gotchas
### ⚠️ JSON object value replaces the entire block
`assistant config set llm.callSites.<key> '{...}'` with a JSON object **replaces the entire `llm.callSites` block**, not just that key.
- ✅ Single leaf value (safe): `assistant config set llm.callSites.mainAgent.profile balanced`
- ✅ Multiple / object values: always set `llm.callSites` as a **single JSON blob** (see Step 5)
- ❌ Never do: `assistant config set llm.callSites.memoryExtraction '{"profile":"cost-optimized"}'` — wipes all other overrides
### ⚠️ Always use profile references — never direct model
❌ Wrong (shows "Custom" with empty provider/model in UI, won't track profile updates):
```bash
assistant config set llm.callSites.memoryExtraction.model claude-haiku-4-5-20251001
```
✅ Correct (shows "Speed" in UI):
```bash
assistant config set llm.callSites.memoryExtraction.profile cost-optimized
```
### Profile + tuning fields can coexist
`profile` sets provider/model/connection. You can still add `effort`, `maxTokens`, `temperature`, `thinking`, `contextWindow` alongside it:
```json
{
"profile": "cost-optimized",
"maxTokens": 4096,
"effort": "low",
"temperature": 0,
"thinking": { "enabled": false, "streamThinking": false }
}
```
---
## Step 5 — Apply the complete turnkey blob
This covers **every known call site** — nothing falls back to default. Copy, paste, apply:
> **Note:** The canonical shipped defaults live in `assistant/src/config/call-site-defaults.ts`. The blob below can be used to override a user's config, but call sites without explicit user overrides already resolve to the defaults defined in that file. If new call sites have been added since this skill was written, add them there (default to `cost-optimized` unless they involve reasoning or memory consolidation).
```bash
assistant config set llm.callSites '{
"mainAgent": {"profile":"balanced"},
"subagentSpawn": {"profile":"balanced"},
"compactionAgent": {"profile":"balanced"},
"analyzeConversation": {"profile":"balanced"},
"patternScan": {"profile":"balanced"},
"narrativeRefinement": {"profile":"balanced"},
"memoryRouter": {"profile":"cost-optimized","contextWindow":{"maxInputTokens":1000000}},
"heartbeatAgent": {"profile":"cost-optimized","maxTokens":2048,"effort":"low","temperature":0,"thinking":{"enabled":false,"streamThinking":false},"contextWindow":{"maxInputTokens":16000}},
"filingAgent": {"profile":"cost-optimized"},
"callAgent": {"profile":"balanced"},
"proactiveArtifactDecision":{"profile":"cost-optimized"},
"proactiveArtifactBuild": {"profile":"balanced"},
"memoryExtraction": {"profile":"cost-optimized"},
"memoryConsolidation": {"profile":"balanced">
>
>
>
Check Vellum Assistant architecture and package boundaries. Use when editing imports, moving code, adding endpoints, touching assistant/gateway/client/skill boundaries, or reviewing architecture-sensitive changes.
Review Vellum Assistant code changes for correctness, repo-specific quality rules, security risks, and missing validation. Use when reviewing diffs, preparing a PR, finishing implementation work, or when the user asks for a code review, quality pass, or pre-merge check in this repository.
Guide Vellum Assistant feature flag changes and rollout hygiene. Use when adding, editing, reviewing, or documenting assistant feature flags, rollout-gated behavior, or platform flag follow-up work.
Validate Vellum Assistant database and workspace migrations. Use when adding, editing, reviewing, or testing migrations, release-note migrations, persisted schemas, workspace file formats, or data backfills.