DeepSlide: Generating AI Presentations Is More Than Making Pretty Slides

Most AI slide generation tools compete to produce the most visually attractive deck. What almost none of them measure is whether the speaker can actually defend that material in front of a real audience within the agreed time. This week, a team of researchers published DeepSlide: From Artifacts to Presentation Delivery on arXiv, work that directly challenges this bias.

The distinction the paper proposes is straightforward but has practical implications: there is a difference between the artifact (the slides as a static object) and the delivery (the process of presenting them, with pacing, narrative flow, and speaker preparation). Optimizing only the first leaves half the problem unsolved.

What DeepSlide Is and How It Works

DeepSlide is a multi-agent system with human-in-the-loop intervention that covers the complete cycle of presentation preparation. According to the paper, the system integrates four main components:

A controllable logical chain planner with time budgets per node, allowing you to allocate how many minutes each section of the narrative deserves before generating a single slide.
A content-tree retriever to ground content in concrete evidence, rather than generating plausible text without basis.
Sequential Markov-style rendering with style inheritance between slides, to maintain visual coherence without each slide being an isolated design.
Sandbox execution with minimal repair, to ensure the result is always renderable without extensive manual intervention.

Additionally, the system includes modules for requirements elicitation (what the presentation needs, what audience), attention augmentation (aids to highlight what is critical on each slide) and active rehearsal support.

The Dual Scoreboard Benchmark

This is arguably the most interesting contribution of the work from a methodological standpoint. The authors introduce a dual scoreboard benchmark that explicitly separates two dimensions:

1. Quality of the static artifact: design, visual structure, readability.
2. Excellence in dynamic delivery: narrative fluidity, pacing accuracy, coherence between script and slide.

Current field metrics mix both or focus almost exclusively on the first. According to results reported in the paper, DeepSlide matches baseline systems in artifact quality but shows consistently larger gains in delivery metrics, including narrative flow and pacing accuracy, across evaluations on 20 domains and different audience profiles.

Who This Is Useful For

The work has immediate application in at least three contexts:

Researchers and academics preparing scientific presentations with limited exposure time (conferences, thesis defenses, poster pitches). The explicit temporal planning of the narrative is especially useful here.
Teams using AI agents for content production, where this type of multi-agent architecture with sandbox and minimal repair can serve as a design reference, regardless of domain.
Developers of AI-powered productivity tools who want to integrate delivery metrics into their evaluation pipelines, something the proposed benchmark makes more accessible.

That said, the paper is academic research published this week, with no publicly available code nor integration with any known platform at the time of writing. The gap between an arXiv result and a usable tool is not always short.

Our Take

The conceptual separation between artifact and delivery that DeepSlide proposes is more valuable than the architecture itself: it names a problem that anyone who has used these tools has felt but struggled to articulate. If the dual scoreboard benchmark is adopted by other teams, it could significantly improve how this category of tools is evaluated in the coming months.

DeepSlide: Generating AI Presentations Is More Than Making Pretty Slides

What DeepSlide Is and How It Works

The Dual Scoreboard Benchmark

Who This Is Useful For

Our Take

Sources

Read next

AINTMA: six AI agents to automate software test management

LLM watermarks degrade the quality of medical texts, study finds

SysAdmin, the benchmark that measures power seeking