PathoSage: Pathological Reasoning Without Context Contamination
A new agent framework for computational pathology separates evidence retrieval, collection, and adjudication to reduce hallucinations and tool conflicts.
Multimodal models applied to computational pathology have a concrete and well-documented problem: when asked to reason about tissue images at the patch level, they tend to hallucinate morphological features that aren't actually there. This is compounded by another less visible but equally dangerous issue: when an agentic system combines outputs from multiple tools and retrieved knowledge in the same context, a single incorrect piece of evidence can contaminate all subsequent reasoning.
That's precisely what PathoSage aims to solve. The paper, published on 9 June 2026 on arXiv (2606.07549), proposes a three-stage framework that keeps separate the steps of knowledge retrieval, evidence collection, and final adjudication. The key lies not just in the separation itself, but in the fact that the final decision is made in a clean context, where previous outputs cannot bias the judgment.
Three stages, a clean context for decision-making
PathoSage's architecture is built around what its authors call Structured Evidence Deliberation (SED). Unlike typical agentic systems, where tools, retrievals, and prior context accumulate in a single reasoning thread, SED evaluates each heterogeneous source of evidence independently, performs a conflict analysis between them, and only then generates the final judgment in a fresh context.
This design choice addresses a classic problem in chain-of-thought reasoning: anchoring bias, the tendency for a model to remain anchored to the first piece of information received even if it's incorrect. By isolating the adjudication phase, PathoSage reduces the likelihood that an incorrect classifier or noisy retrieval will drag the entire conclusion astray.
An experience system without additional training
The second notable component is the Beta-Bernoulli experience system with continuous credit assignment. Its function is to model the long-term reliability of each tool in the agent: if a tool has failed recurrently in similar cases, the system adjusts its weight in future reasoning by building priors weighted by similarity.
What's relevant here is that this happens without retraining. There's no fine-tuning cycle to update the base model's weights; experience accumulates parametrically in the Beta distributions assigned to each tool. It's a mechanism reminiscent of Bayesian bandit systems, transposed into the context of tool selection and weighting in a clinical workflow.
Why it matters in pathology and beyond
Computational pathology is a domain where the consequences of a hallucination are not abstract: misclassification of tumoral tissue can influence diagnostic decisions. Current end-to-end MLLM systems have improved greatly on general benchmarks, but patch-level reasoning remains a fragile point, partly because histological images contain highly local and sometimes ambiguous patterns, even for human pathologists.
PathoSage differs from previous proposals in that it doesn't attempt to solve the problem by increasing the base model's capacity, but rather by redesigning the information flow between tools and decision. It's an architectural, not parametric, approach, which in principle makes it more transferable to other medical domains where multimodal evidence is heterogeneous and potentially contradictory: radiology, dermatology, biomarker analysis.
The experiments reported show that the framework effectively mitigates hallucinations in Visual Question Answering (VQA) tasks and reduces disagreement between classifiers, though the paper has not yet undergone peer review, which is worth keeping in mind before extrapolating the results.
Who this is relevant to right now
In the short term, this work is of primary interest to applied research teams in medical AI and engineering groups building assisted diagnostic pipelines. The idea of an isolated adjudication context and the per-tool credit system are principles that, with necessary adaptations, can be applied to any agentic architecture where multiple specialized tools must combine reliably.
For those of us working with agent frameworks in high-stakes environments, PathoSage offers a concrete design pattern: don't mix collection with decision-making. It's a useful reminder that context contamination isn't just a matter of window size, but of workflow architecture.
---
From our perspective, this is a solid piece of work in its architectural approach, especially the credit mechanism without retraining. We'll need to wait for peer review and testing in real clinical settings to gauge how far the practical advantage extends over simpler agentic systems.
Sources
Read next
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.
ToolSense: How to Audit What an LLM Really Knows About Its Tools
A new diagnostic framework published on arXiv reveals that models retrieving tools parametrically can score well on standard metrics without actually understanding what each tool does.
Business World Model: How AI Agents Learn to Reason About Companies
A new arXiv paper proposes a formal architecture enabling AI agents to model the state and dynamics of an entire business before acting, rather than simply executing predefined tasks.