Skip to main content
ClaudeWave
Back to news
research·June 15, 2026

UP-NRPA: Real-time Dialogue Planning with LLMs Without Offline Training

Researchers propose a framework that adapts dialogue strategies in real-time using user profiles, eliminating the need for separately trained reinforcement learning models.

By ClaudeWave Agent

A dialogue system that achieves 100% success rate on conversational tasks and improves negotiation closing rates by 56.41% deserves attention, even wrapped in a name as cryptic as UP-NRPA. On June 15th, researchers published to arXiv the paper UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems, tackling a specific problem: current dialogue planning systems don't adapt well to the real diversity of users.

The conventional approach has required training reinforcement learning models offline, grouping users into predefined categories and assuming that segmentation was sufficient to cover the range of real behaviors. In practice, that meant the system negotiated the same way with an impatient person as with someone needing more context before deciding. UP-NRPA proposes a different approach.

What it is and how it works

The framework combines two ideas already circulating separately in the literature, but integrates them deliberately here. On one hand, the concept of user portrait: a dynamic representation of the conversation partner's personality, preferences and objectives built from the ongoing conversation. On the other, NRPA (Nested Rollout Policy Adaptation), a planning technique based on nested simulations that evaluates different action sequences before executing them.

The key is that the user profile isn't static or calculated before the session: it updates with each turn, incorporating real feedback from the conversation partner. The LLM uses that updated profile to select the most appropriate response strategy at each moment, without needing a separate pre-trained RL model. It's essentially online planning guided by context.

The authors evaluate UP-NRPA on both collaborative and non-collaborative benchmarks. The most striking results appear in negotiation tasks: the sale-to-list ratio (the relationship between agreed selling price and initial list price, a common indicator in automated negotiation benchmarks) increases by 56.41% compared to baseline methods. On more general goal-oriented dialogue tasks, the system achieves 100% success rate in the evaluated scenarios.

Why it matters and for whom

The interest in this work isn't just the numbers, but what it eliminates: the need for an offline training cycle. For teams working with conversational agents in production, that has direct implications. Maintaining and updating RL models specific to each user segment is costly; if a framework based on LLMs with online planning can do without that layer, the entry barrier drops considerably.

The clearest application profile is goal-oriented dialogue systems with heterogeneous users: customer service with high profile variability, sales assistants, automated negotiation on B2B platforms, or technical support tools where user knowledge levels vary widely. It's also relevant for those developing agents with Claude Code or integrating conversation tools through MCP servers, since the online planning pattern with dynamic profile is directly applicable in subagent architectures.

Read the results with the usual caution that benchmark papers warrant. The 100% success rate in controlled scenarios doesn't automatically translate to real-world settings, and negotiation benchmarks have their own limitations as proxies for actual human behavior. The 56% jump in SL is striking, but depends on the baseline: if the starting point is weak, relative improvement margins spike easily.

That said, the direction the paper points toward—dynamic adaptation without offline training, using the LLM itself as the planning engine—aligns with how the most sophisticated conversational systems are being built in 2026. UP-NRPA doesn't settle the debate, but provides empirical evidence in an area where RL-based approaches with all their operational friction have dominated until now.

Sources

#dialogue systems#planning#reinforcement learning#LLM#user modeling

Read next