OCR, Not the LLM, Is the Bottleneck in Document Processing Pipelines

When engineering teams bring Document AI to production, the conventional wisdom is that the language model will be the expensive part: the component consuming GPU, taking time, scaling poorly. A paper published this week on arXiv challenges that intuition with data from a real pipeline processing thousands of multi-page documents per hour. The most striking conclusion: OCR, not the LLM, dominates the total latency of the system.

The work, titled Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production (arXiv:2605.18818), fills a gap that academic literature usually ignores: the space between "we have a working model" and "the model processes real load without failing".

What the architecture proposes

The authors describe a microservices architecture that chains three stages: document classification, optical character recognition, and structured field extraction via LLM. Each stage is an independent service, allowing them to scale separately based on their computational nature.

The design decisions they detail are exactly the ones any engineering team ends up making when incidents hit production:

Separation of GPU inference from CPU orchestration. Workers that call the model live in different processes from those coordinating the workflow. Mixing them causes hard-to-diagnose resource contention.
Asynchronous processing for I/O operations. Most waits in a document pipeline are I/O: reading files, calling APIs, writing results. Blocking threads on those waits is the fastest path to an infinite queue.
Independent horizontal scaling per stage. Each microservice scales replicas up or down based on its own load profile, rather than scaling the entire system as a single block.
Hybrid classification. Combining a lightweight classifier (that quickly discards documents not requiring full processing) with heavier models reduces unnecessary work before it reaches the LLM.

The two surprises from batch profiling

The authors applied batch profiling to the complete pipeline and extracted two findings that, as they acknowledge, they didn't expect and that change how deployments are planned.

The first: OCR accounts for most of the end-to-end latency. In multi-page documents with complex layouts, optical recognition time clearly exceeds subsequent LLM inference time. This has direct implications for where to invest in hardware and which stage to optimize first.

The second: the system doesn't saturate by number of workers, but by shared GPU inference capacity. Adding more orchestration workers without increasing GPU capacity doesn't improve throughput; it only increases the wait queue. The actual saturation point is set by the GPU, not the parallelism of the processes calling it.

Who this is useful for

This paper isn't aimed at model researchers. It's aimed at teams that already have a working document extraction model in staging and need to bring it to production without it breaking in two weeks.

It's especially relevant for those building integrations with Claude or other LLMs on top of enterprise document workflows: invoices, contracts, medical records, forms. In those contexts, documents don't arrive one at a time; they arrive in bursts, and the architecture that works in a notebook rarely handles a thousand pages per hour.

Companies using Claude via API for structured field extraction will find here a solid reference framework for the wrapper surrounding that call: how to manage the queue, how to size OCR replicas, and how not to confuse GPU saturation with lack of workers.

What it doesn't cover

The paper focuses on execution architecture, not model quality. It doesn't evaluate which LLM extracts fields better or which OCR engine makes fewer errors. It also doesn't address result caching strategies or handling documents with highly irregular layouts that break classifier assumptions.

These are limitations that the authors themselves acknowledge, and they leave room for future work.

---

The paper's main contribution is modest in the best sense: it doesn't propose anything new, but rather documents with rigor what works. In a field where literature accumulates in model benchmarks and is scarce in operational guides, that has more practical value than it might seem.

OCR, Not the LLM, Is the Bottleneck in Document Processing Pipelines

What the architecture proposes

The two surprises from batch profiling

Who this is useful for

What it doesn't cover

Sources

Read next

AINTMA: six AI agents to automate software test management

LLM watermarks degrade the quality of medical texts, study finds

SysAdmin, the benchmark that measures power seeking