ToolSense: How to Audit What an LLM Really Knows About Its Tools
A new diagnostic framework published on arXiv reveals that models retrieving tools parametrically can score well on standard metrics without actually understanding what each tool does.
There is a silent problem in agent systems that manage large tool catalogs: models can score well on standard retrieval benchmarks while having little to no understanding of what most of those tools actually do. A paper published June 12 on arXiv—ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs—demonstrates this systematically and proposes a diagnostic framework to detect it.
The work starts from a concrete observation: reference benchmarks like ToolBench use verbose, fully specified queries and apply constrained decoding that forces the model to choose between valid token paths. Under these conditions, it is easy to achieve high scores without the model truly comprehending the semantics of the tools. It is equivalent to passing a multiple-choice exam from memory without having actually understood the subject matter.
What ToolSense Proposes
ToolSense is an open-source framework that, given any tool catalog as input, automatically generates three distinct benchmarks:
- Realistic Retrieval Benchmark (RRB): queries organized across three ambiguity levels, from the most explicit to those mimicking how real users describe what they need—without directly mentioning the tool.
- MCQ probing benchmark: multiple-choice questions that probe whether the model truly distinguishes between similar tools.
- QA probing benchmark: open-ended questions that require describing the purpose and parameters of a specific tool.
The authors apply ToolSense to ToolBench—approximately 47,000 tools—and evaluate five distinct parametric training configurations. The results, though not fully detailed in the abstract, point to a clear dissociation: configurations that perform well on the standard benchmark do not necessarily demonstrate real understanding when subjected to ambiguous queries or probing questions.
Why This Matters Beyond the Paper
For anyone working with Claude Code, MCP servers, or any agent architecture over broad tool catalogs, this research has direct consequences. The promise of parametric retrieval is attractive: instead of relying on an external encoder to find the right tool, the LLM itself carries that knowledge built in. But if that knowledge is fragile—it works with well-formed queries but fails with imprecise natural language—the system breaks exactly when you need it most.
The problem is compounded with dynamic catalogs. In environments where tools are frequently added, modified, or versioned—common in active MCP integrations—the memorization phase can become outdated without conventional benchmarks detecting it.
ToolSense, by automating the generation of diagnostic tests from your actual catalog, offers something static benchmarks cannot: a mirror calibrated to the real tools your system uses, not those in a generic reference dataset. Being open-source and working with any catalog as input makes it especially practical for teams maintaining custom integrations.
Who Finds This Useful
The paper targets researchers in agent systems and tool retrieval, but it has practical relevance for:
- Teams running MCP servers with dozens or hundreds of tools who want to know if the model truly understands each one.
- Engineers evaluating retrieval strategies before choosing between classic embeddings and parametric approaches.
- Those building plugins or sub-agents on Claude Code that delegate tool selection to the model itself.
---
Editor's note: That someone has formalized the difference between "retrieving well" and "truly understanding" is useful and necessary. Static benchmarks have been too convenient a proxy for too long; ToolSense does not solve the underlying problem, but at least it makes it visible and measurable.
Sources
Read next
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.
Business World Model: How AI Agents Learn to Reason About Companies
A new arXiv paper proposes a formal architecture enabling AI agents to model the state and dynamics of an entire business before acting, rather than simply executing predefined tasks.
PathoSage: Pathological Reasoning Without Context Contamination
A new agent framework for computational pathology separates evidence retrieval, collection, and adjudication to reduce hallucinations and tool conflicts.