ToolSense: How to Audit What an LLM Really Knows About Its Tools

There is a silent problem in agent systems that manage large tool catalogs: models can score well on standard retrieval benchmarks while having little to no understanding of what most of those tools actually do. A paper published June 12 on arXiv—ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs—demonstrates this systematically and proposes a diagnostic framework to detect it.

The work starts from a concrete observation: reference benchmarks like ToolBench use verbose, fully specified queries and apply constrained decoding that forces the model to choose between valid token paths. Under these conditions, it is easy to achieve high scores without the model truly comprehending the semantics of the tools. It is equivalent to passing a multiple-choice exam from memory without having actually understood the subject matter.

What ToolSense Proposes

ToolSense is an open-source framework that, given any tool catalog as input, automatically generates three distinct benchmarks:

Realistic Retrieval Benchmark (RRB): queries organized across three ambiguity levels, from the most explicit to those mimicking how real users describe what they need—without directly mentioning the tool.
MCQ probing benchmark: multiple-choice questions that probe whether the model truly distinguishes between similar tools.
QA probing benchmark: open-ended questions that require describing the purpose and parameters of a specific tool.

The approach is designed for parametric retrieval, a technique that encodes each tool as a virtual token added to the LLM's vocabulary and trains it in two phases: memorization first, then supervised retrieval fine-tuning. It is an alternative to compact embedding encoders, which may not capture the specialized semantics of heterogeneous catalogs well.

The authors apply ToolSense to ToolBench—approximately 47,000 tools—and evaluate five distinct parametric training configurations. The results, though not fully detailed in the abstract, point to a clear dissociation: configurations that perform well on the standard benchmark do not necessarily demonstrate real understanding when subjected to ambiguous queries or probing questions.

Why This Matters Beyond the Paper

For anyone working with Claude Code, MCP servers, or any agent architecture over broad tool catalogs, this research has direct consequences. The promise of parametric retrieval is attractive: instead of relying on an external encoder to find the right tool, the LLM itself carries that knowledge built in. But if that knowledge is fragile—it works with well-formed queries but fails with imprecise natural language—the system breaks exactly when you need it most.

The problem is compounded with dynamic catalogs. In environments where tools are frequently added, modified, or versioned—common in active MCP integrations—the memorization phase can become outdated without conventional benchmarks detecting it.

ToolSense, by automating the generation of diagnostic tests from your actual catalog, offers something static benchmarks cannot: a mirror calibrated to the real tools your system uses, not those in a generic reference dataset. Being open-source and working with any catalog as input makes it especially practical for teams maintaining custom integrations.

Who Finds This Useful

The paper targets researchers in agent systems and tool retrieval, but it has practical relevance for:

Teams running MCP servers with dozens or hundreds of tools who want to know if the model truly understands each one.
Engineers evaluating retrieval strategies before choosing between classic embeddings and parametric approaches.
Those building plugins or sub-agents on Claude Code that delegate tool selection to the model itself.

The framework is available via the link in the paper; associated repositories typically appear in the code section of arXiv or in the full PDF.

---

Editor's note: That someone has formalized the difference between "retrieving well" and "truly understanding" is useful and necessary. Static benchmarks have been too convenient a proxy for too long; ToolSense does not solve the underlying problem, but at least it makes it visible and measurable.

ToolSense: How to Audit What an LLM Really Knows About Its Tools

What ToolSense Proposes

Why This Matters Beyond the Paper

Who Finds This Useful

Sources

Read next

AINTMA: six AI agents to automate software test management

LLM watermarks degrade the quality of medical texts, study finds

SysAdmin, the benchmark that measures power seeking