DeepSlide: Generating AI Presentations Is More Than Making Pretty Slides
A new multi-agent system separates the visual quality of slides from actual presentation delivery: pacing, narrative flow, and speaker preparation.
Most AI slide generation tools compete to produce the most visually attractive deck. What almost none of them measure is whether the speaker can actually defend that material in front of a real audience within the agreed time. This week, a team of researchers published DeepSlide: From Artifacts to Presentation Delivery on arXiv, work that directly challenges this bias.
The distinction the paper proposes is straightforward but has practical implications: there is a difference between the artifact (the slides as a static object) and the delivery (the process of presenting them, with pacing, narrative flow, and speaker preparation). Optimizing only the first leaves half the problem unsolved.
What DeepSlide Is and How It Works
DeepSlide is a multi-agent system with human-in-the-loop intervention that covers the complete cycle of presentation preparation. According to the paper, the system integrates four main components:
- A controllable logical chain planner with time budgets per node, allowing you to allocate how many minutes each section of the narrative deserves before generating a single slide.
- A content-tree retriever to ground content in concrete evidence, rather than generating plausible text without basis.
- Sequential Markov-style rendering with style inheritance between slides, to maintain visual coherence without each slide being an isolated design.
- Sandbox execution with minimal repair, to ensure the result is always renderable without extensive manual intervention.
The Dual Scoreboard Benchmark
This is arguably the most interesting contribution of the work from a methodological standpoint. The authors introduce a dual scoreboard benchmark that explicitly separates two dimensions:
1. Quality of the static artifact: design, visual structure, readability.
2. Excellence in dynamic delivery: narrative fluidity, pacing accuracy, coherence between script and slide.
Current field metrics mix both or focus almost exclusively on the first. According to results reported in the paper, DeepSlide matches baseline systems in artifact quality but shows consistently larger gains in delivery metrics, including narrative flow and pacing accuracy, across evaluations on 20 domains and different audience profiles.
Who This Is Useful For
The work has immediate application in at least three contexts:
- Researchers and academics preparing scientific presentations with limited exposure time (conferences, thesis defenses, poster pitches). The explicit temporal planning of the narrative is especially useful here.
- Teams using AI agents for content production, where this type of multi-agent architecture with sandbox and minimal repair can serve as a design reference, regardless of domain.
- Developers of AI-powered productivity tools who want to integrate delivery metrics into their evaluation pipelines, something the proposed benchmark makes more accessible.
Our Take
The conceptual separation between artifact and delivery that DeepSlide proposes is more valuable than the architecture itself: it names a problem that anyone who has used these tools has felt but struggled to articulate. If the dual scoreboard benchmark is adopted by other teams, it could significantly improve how this category of tools is evaluated in the coming months.
Sources
Read next
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.
ToolSense: How to Audit What an LLM Really Knows About Its Tools
A new diagnostic framework published on arXiv reveals that models retrieving tools parametrically can score well on standard metrics without actually understanding what each tool does.
Business World Model: How AI Agents Learn to Reason About Companies
A new arXiv paper proposes a formal architecture enabling AI agents to model the state and dynamics of an entire business before acting, rather than simply executing predefined tasks.