BOHM: Zero-Cost Hierarchical Attribution for AI Agent Systems
A new arXiv method extracts component attribution in composite AI systems using only routing weights, without subset evaluation or access to closed APIs.
When you build an agent system with Claude Code—multiple subagents, MCP servers, and chained tools—and something breaks, the first question is: which component failed? SHAP, the standard attribution method based on Shapley value theory, answers that question by evaluating the system across thousands of component combinations. The problem is that many of those combinations are literally impossible to evaluate: third-party APIs you can't modify, opaque orchestrators, endpoints with no internal access. In that scenario, SHAP doesn't work.
This week, arXiv published BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems, a method that solves exactly that bottleneck in a fairly direct way: instead of evaluating coalitions of components, it extracts attribution directly from the routing weights that the system already maintains through its own architecture.
What BOHM does and how it works
BOHM's central intuition is that a hierarchical orchestrator—the kind of system you build when chaining subagents or skills in Claude Code—already implicitly encodes how much it trusts each branch of its decision tree. Those routing weights are what determine which tool or subagent receives each task.
BOHM converts them into formal attribution in two ways:
- Leaf attribution: product of weights along the path from root to that component. If the router sends 80% of traffic to a subagent and it forwards 60% to a specific tool, that tool receives attribution of 0.48.
- Level-k attribution: induced distribution across all nodes at that depth. This lets you view the system at different resolutions simultaneously, something flat methods like SHAP cannot offer without additional evaluation costs.
Why it matters for those working with Claude
In May 2026, agent systems with Claude are no longer experiments. It's routine to see orchestrators with three or four levels of subagents, specialized skills, and MCP servers connected to external services—databases, business APIs, code tools. Any team operating this type of architecture in production faces the same debugging problem: when output is incorrect or token costs spike, which branch of the tree is the source?
Until now, options were to manually instrument each component or rely on low-level logs. BOHM offers a structured alternative that doesn't require modifying components: simply capture the routing weights the orchestrator already generates. In Claude Code, where lifecycle hooks—`PreToolUse`, `PostToolUse`—already expose information about which tool is invoked and with what relative weight, integrating something similar to BOHM is technically achievable without touching subagent logic.
The method's limits
The authors themselves are explicit that BOHM and SHAP answer different questions. SHAP measures the causal contribution of each component by isolating its effect; BOHM measures how much the router trusts each branch. Both converge when the router is nearly optimal—that is, when it routes tasks well to the most capable components—but can diverge in poorly calibrated systems.
In their experiments, the authors tested 18 LLMs in a three-level hierarchy across 880 LiveCodeBench problems. Ranking correlation results (Kendall) favor BOHM under concentrated routing conditions, which is precisely the most common case in production: an orchestrator that channels most traffic to two or three main tools.
The practical question the paper leaves open is how it behaves when routing is learned implicitly—as in systems where Claude decides which tool to invoke without an explicit routing layer with accessible weights. That's the most frequent case today with Claude Code in its default configuration.
---
From our perspective, what matters is not that BOHM is the definitive solution to agent interpretability, but that it charts a zero-cost path that fits well with the reality of production systems: opaque, chained, and with components you don't control. It's worth tracking when reference code becomes available.
Sources
Read next
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.
ToolSense: How to Audit What an LLM Really Knows About Its Tools
A new diagnostic framework published on arXiv reveals that models retrieving tools parametrically can score well on standard metrics without actually understanding what each tool does.
Business World Model: How AI Agents Learn to Reason About Companies
A new arXiv paper proposes a formal architecture enabling AI agents to model the state and dynamics of an entire business before acting, rather than simply executing predefined tasks.