WebRTC Sabotages Voice Prompts: Why Video Call Protocol Fails for LLMs

When Discord tried to add reliability to its audio packets, it discovered that the WebRTC specification deliberately prevents it: there is no way to retransmit a packet within the browser, even if the developer explicitly requests it. That limitation, tolerable for a team call, becomes a serious problem when the discarded audio is a prompt directed at an LLM.

Luke Curley, an engineer who worked on this issue at Discord, summarizes it with uncomfortable precision in his post "WebRTC is the problem", picked up by Simon Willison on his blog on May 9: WebRTC is designed to degrade and discard your prompt under adverse network conditions. It is not a bug; it is the protocol's design philosophy.

The protocol that chooses latency over accuracy

WebRTC was conceived for real-time communication: video conferences, voice calls, interactive streaming. In that context, the absolute priority is keeping latency low. If the network is congested, the protocol discards audio packets rather than waiting for them to arrive; a small audio glitch is preferable to half a second of silence that disrupts the rhythm of conversation.

That logic makes sense when both people on the call can ask "what did you say?" It makes no sense when the audio receiver is a language model that will generate a response based on exactly what it heard. A corrupted prompt produces a corrupted response. There is no automatic "can you repeat that?"; there is simply incorrect output, expensive and difficult to debug.

Curley points out the irony with full force: the user is paying for the inference, language models are not exactly fast to begin with, and yet the transport that the voice application uses is optimized to sacrifice accuracy in favor of a few milliseconds of latency that, in this context, nobody asked for.

Why this especially affects voice interfaces with LLMs

Voice interfaces over LLMs have gained considerable traction in the past year. Products that allow you to speak directly with models like Claude or its competitors depend on capturing microphone audio, sending it to a transcription server or directly to the model in multimodal mode, and processing the response. If the transport discards audio fragments along the way, the model receives incomplete input.

In a video call between people, losing 200 ms of audio is a minor inconvenience. In an instruction to an AI agent, "cancel order number 4821 and notify the customer," losing those 200 ms might mean the model hears "cancel the order" without the rest, with potentially very different consequences.

The problem is worse in mobile environments or with irregular connectivity, precisely the scenarios where voice interfaces make the most sense as an alternative to typing.

Emerging alternatives: MoQ and transport redesign

The proposal underlying Curley's post is MoQ (Media over QUIC), a protocol in development within the IETF that uses QUIC as a transport layer and allows configuring more granular delivery policies: the developer can decide whether a specific stream prioritizes latency or reliability, rather than having that behavior hardcoded.

For voice applications with LLMs, that would mean being able to mark the prompt stream as "wait for complete arrival" and the synthesized audio response stream as "prioritize latency." Two different policies for two different needs within the same session.

MoQ is not ready for widespread production, and WebRTC remains the de facto standard in browsers. In the short term, teams building voice interfaces over LLMs have limited options: use WebSockets with custom flow control, accept the losses and add incomplete prompt detection logic, or avoid the browser and operate from a native client where transport control is greater.

Our take

It is tempting to see this as a niche problem, but with the growth of voice agents in real work environments, the lack of reliability in prompt transport is going to become a friction point with practical consequences. It deserves more attention than it receives in the usual discussions about model latency and transcription quality.

WebRTC Sabotages Voice Prompts: Why Video Call Protocol Fails for LLMs

The protocol that chooses latency over accuracy

Why this especially affects voice interfaces with LLMs

Emerging alternatives: MoQ and transport redesign

Our take

Sources

Read next

GPT-5.5 Instant: OpenAI Claims 52.5% Fewer Hallucinations, But the Data Is Theirs

When LLMs Help Design Pathogens: The Biosecurity and AI Debate

SMG: Why Separate CPU and GPU in LLM Serving