Skip to main content
ClaudeWave
Skill894 estrellas del repoactualizado 2d ago

latent-briefing

Latent Briefing enables multi-agent systems to share the orchestrator's reasoning state with worker agents through KV cache compaction rather than text summarization. Use this skill when designing hierarchical agent systems that need to transfer prior context efficiently, implementing Attention Matching-based cache compaction for inference-time handoff, or addressing token explosion in recursive agent graphs without sacrificing information coverage.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/guanyang/open-agent-hub /tmp/latent-briefing && cp -r /tmp/latent-briefing/skills/latent-briefing ~/.claude/skills/latent-briefing
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Latent Briefing and KV Cache Memory Sharing

Hierarchical multi-agent systems often pay for the same context twice. The orchestrator accumulates a long reasoning trajectory, but each worker usually receives only a narrow text handoff such as a subtask prompt plus raw document slices. Passing the full trajectory fixes coverage but drives token cost up on every worker call. Summarization introduces latency and information loss. Retrieval helps with document access but does not preserve the orchestrator's evolving reasoning state.

Latent Briefing addresses this by sharing memory at the **representation level** rather than the text level. The core idea is to compact the orchestrator trajectory in the worker model's KV cache, keeping positions that are most relevant to the **current worker task**. The method builds on **Attention Matching (AM)** KV cache compaction and adapts it for inference-time multi-agent handoff with task-guided queries, a shared token mask across heads, and robust thresholding.

## When to Activate

Activate this skill when:

- Designing orchestrator-worker or supervisor-specialist systems where workers need access to prior orchestrator state without replaying the full trajectory as text
- Evaluating alternatives to LLM summarization or RAG for cross-agent state transfer
- Implementing or studying **KV cache compaction** as a first-class inference primitive, not only prefix caching of identical prompts
- Debugging token explosion in recursive, hierarchical, or tool-heavy agent graphs
- Interpreting benchmarks that report worker-token savings, total-token savings, compaction overhead, and accuracy together

Do not activate this skill for adjacent work owned by other skills:
- API-only stacks where internal KV tensors are inaccessible: use `context-compression`, `memory-systems`, or `multi-agent-patterns`.
- Ordinary persistent memory, entity tracking, or graph retrieval: `memory-systems`.
- General multi-agent topology without representation-level state sharing: `multi-agent-patterns`.
- Prefix caching, masking, or budget policy that does not transform KV state: `context-optimization`.

## Core Concepts

**The token explosion pattern.** In recursive or REPL-style systems, the orchestrator repeatedly calls a worker to inspect evidence, verify hypotheses, or answer subquestions. The orchestrator's trajectory grows with partial conclusions, dead ends, tool output, and prior worker responses. If that trajectory is passed in full on every worker call, cost compounds quickly.

**Representation-level sharing.** Instead of summarizing the trajectory into natural language, the system operates on the worker model's **KV cache**. It retains the positions that the worker would attend to for the current task and drops the rest. This is more specific than ordinary prefix caching: prefix caching reuses identical prefixes, while Latent Briefing also performs **task-conditioned selective retention** inside the reused trajectory.

**Attention Matching as the compaction engine.** AM seeks a smaller cache whose attention outputs approximate the full cache. Latent Briefing adapts AM for multi-agent inference by changing the scoring signal and batching strategy:

1. Use **task-guided query vectors** derived from the current worker prompt.
2. Aggregate scores into a **shared global mask** instead of per-head independent subsets.
3. Use a robust threshold such as `median + tau * MAD` rather than fixed top-k per head.

**Reference result shape.** The public write-up reports substantial worker-token reduction, material total-token savings, and low-single-digit-second compaction overhead on long-document QA workloads (claim-latent-briefing-public-results). Treat these numbers as workload-specific evidence, not a general guarantee.

## Detailed Topics

### Why Text-Only Mitigations Fall Short

| Approach | Primary weakness |
|----------|------------------|
| LLM summarization | High latency, lossy abstraction, and no guarantee the summary preserves what the next subtask needs |
| Retrieval / RAG | Depends on chunking and embeddings; can miss cross-chunk or cross-step dependencies |
| Pass full trajectory | Cost scales with every worker call and irrelevant context can degrade worker quality |

Latent Briefing is useful when the bottleneck is not document retrieval itself, but **how to transfer orchestrator state into a worker efficiently and precisely**.

### Recursive Orchestrator-Worker Shape

Frameworks such as **Recursive Language Models** treat long context as an environment and recurse over it: an orchestrator decomposes work and delegates to workers. Latent Briefing fits the gap where the orchestrator has already built task-specific state that should inform the worker, but re-serializing that state as text is too expensive or noisy.

In the ideal setup, the worker maintains a persistent KV state for the orchestrator trajectory. New trajectory tokens extend that state, then compaction runs just before generation for the current subtask.

### Three Inference-Time Modifications

1. **Task-guided query vectors.** Use queries from the current worker task prompt, not generic samples from the context. Forward-pass the trajectory plus current task through the worker model, then score trajectory positions by how strongly the task attends to them.

2. **Shared token selection.** Aggregate scores across layers and heads into one per-position score. One shared mask enables batched operations and avoids hundreds of incompatible per-head solves.

3. **MAD thresholding.** Keep positions above a robust outlier threshold such as `median + tau * MAD`. Higher `tau` is more aggressive. Optimal settings depend on task regime, trajectory quality, and document length.

### Infrastructure Preconditions

Latent Briefing is only practical when the system **controls the worker inference runtime** closely enough to inspect or transform KV state. It is a poor default for API-only stacks where internal KV tensors are