Skip to main content
ClaudeWave
Skill2.5k estrellas del repoactualizado 27d ago

cascadeflow

Cascadeflow is an in-process agent runtime intelligence layer that enforces cost, latency, quality, budget, and compliance constraints at each step of an AI agent's execution loop with sub-5ms overhead. Use it when building or extending agents with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, or other frameworks and need to implement drafter/verifier model pairs, per-step budget caps, KPI enforcement, tool-call routing, and auditable decision traces without external proxies.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/lemony-ai/cascadeflow /tmp/cascadeflow && cp -r /tmp/cascadeflow/skills/cascadeflow ~/.claude/skills/cascadeflow
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# cascadeflow

## What it is

**Agent runtime intelligence layer.** An in-process harness that sits *inside* the agent execution loop (not at the HTTP boundary) and makes per-step decisions on cost, latency, quality, budget, compliance, and energy. Sub-5ms overhead. Works alongside LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, n8n, and Vercel AI SDK.

Two complementary pieces:

1. **Cascading** — try a cheap "drafter" model first, validate quality, escalate to a "verifier" model only when needed (40–85% cost savings).
2. **Runtime intelligence (harness)** — instrument the agent loop with budget caps, KPI weights, compliance gates, and a full per-step decision trace.

Python (`pip install cascadeflow`) and TypeScript (`@cascadeflow/core`). Docs: https://docs.cascadeflow.ai

## Why "in the loop" matters (the core pitch)

cascadeflow is **not a proxy or a gateway**. It runs inside the agent's process and sees every model call, tool call, and sub-agent handoff as it happens — so it can act on running state (cost so far, tool calls used, compliance flag) at *each step*, not just per HTTP request.

| Dimension | External proxy | cascadeflow harness |
|---|---|---|
| Scope | HTTP request boundary | Inside the agent loop |
| What it can see | One request at a time | Full run state (cost-so-far, step #, tool-calls used, budget remaining) |
| Optimization axes | Cost only | Cost · latency · quality · budget · compliance · energy — simultaneously |
| Latency overhead | 10–50 ms network RTT per call | <5 ms in-process per call |
| 10-step agent loop | +400–600 ms avoidable | negligible |
| Enforcement | Observe only | `allow` · `switch_model` · `deny_tool` · `stop` |
| Auditability | Request logs | Per-step decision trace (one entry per LLM/tool/handoff decision) |
| Business logic | None | Live KPI weights + targets injected at runtime |

This is what unlocks: stop-after-step-7 budget enforcement, deny-this-tool-mid-loop, switch-models-on-this-call, and a full audit trail of *why* every step did what it did. None of that is possible from outside the loop.

## When to use this skill

- User is building an AI agent and wants cost/latency/quality control *inside* the loop
- Code imports `cascadeflow`, `@cascadeflow/core`, `@cascadeflow/langchain`, `@cascadeflow/vercel-ai`, or `@cascadeflow/n8n-nodes-cascadeflow`
- Mentions budgets, compliance (GDPR/HIPAA/PCI), KPI weights, tool-call routing, decision traces, drafter/verifier — *together with* a cascadeflow signal (import, repo path, or explicit cascadeflow mention). Don't fire on unrelated compliance/budget conversations in user code.
- Working inside `lemony-ai/cascadeflow` (examples, integrations, gateway server)
- A bug is discovered in cascadeflow itself or any of its integrations and needs to be fixed upstream

## Pick the right entry point (30-second decision)

| Situation | Use | File/pattern |
|---|---|---|
| Existing OpenAI/Anthropic app, want instant observability | `cascadeflow.init(mode="observe")` | Auto-patches the SDKs. Zero code changes in the app. |
| Existing app, no code changes at all, want gateway | `python -m cascadeflow.server` | Drop-in OpenAI/Anthropic-compatible proxy; point client at `http://127.0.0.1:<port>/v1` |
| New agent, want the default "just works" cascade | `auto_agent()` or `get_cost_optimized_agent()` | Presets — fastest path; no model picking required |
| New agent, custom drafter+verifier | `CascadeAgent(models=[drafter, verifier])` | Both languages |
| Agent function with budget + policy metadata | `from cascadeflow.harness import agent` then `@agent(budget=..., compliance=..., kpi_weights=...)` | Attaches metadata; combine with `cascadeflow.run()` for enforcement. Note: import the decorator from `cascadeflow.harness` — `cascadeflow.agent` resolves to the module, not the decorator. |
| Scoped run with budget and full trace | `with cascadeflow.run(budget=0.50, max_tool_calls=10) as session:` | Primary harness pattern |
| Inside LangChain / OpenAI Agents / CrewAI / PydanticAI / Google ADK / Vercel AI / n8n | Use the integration package | Don't reinvent — the integrations preserve tool calling, streaming, callbacks |

## Minimum viable cascade

**Python:**

```python
from cascadeflow import CascadeAgent, ModelConfig

agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),  # drafter
    ModelConfig(name="gpt-4o",      provider="openai", cost=0.00625),   # verifier
])

result = await agent.run("What's the capital of France?")
print(result.content, result.model_used, result.total_cost, result.cost_saved)
```

**TypeScript:**

```ts
import { CascadeAgent } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o',      provider: 'openai', cost: 0.00625  },
  ],
});

const r = await agent.run('What is TypeScript?');
console.log(r.modelUsed, r.totalCost, r.savingsPercentage);
```

**Even faster — presets (Python):**

```python
from cascadeflow import auto_agent, get_cost_optimized_agent

agent = auto_agent()                       # picks a sensible pair
# or: get_cost_optimized_agent(), get_balanced_agent(),
#     get_quality_optimized_agent(), get_speed_optimized_agent(),
#     get_development_agent()
```

## Runtime intelligence — the harness

This is what makes cascadeflow different from a proxy or a model router. The harness runs **inside** the agent loop and decides per step.

### Three modes, safe rollout

- `off` — no instrumentation (default)
- `observe` — patches OpenAI + Anthropic SDKs, records cost/tokens/decisions, enforces nothing
- `enforce` — same, plus applies actions (see below)

### Per-step actions the harness can take

`allow` · `switch_model` · `deny_tool` · `stop`

Every LLM call, tool call, and sub-agent handoff is a decision point. The harness reads the current run state (cost so far, budget remaining, compliance flag, KPI weights) and