Why AI infrastructure is nothing like classic cloud
Scaling a language model isn't the same as scaling a web app. The infrastructure differences between AI and traditional cloud are deeper than most teams anticipate.
Moving a microservices application to public cloud has decades of documented best practices: horizontal autoscaling, load balancers, managed databases, per-request billing. When a team starts building on language models, they usually assume the same playbook applies. According to an analysis published this week on Substack by Raman Sharma, that assumption is one of the costliest mistakes engineers make today.
The article, which sparked discussion on Hacker News, isn't hype: it's a technical explanation of why patterns that work well in conventional cloud break down when the core component is an inference model.
The GPU problem isn't just about cost
In classic cloud, the compute unit is fungible. A virtual CPU is interchangeable; instances are added or removed in seconds based on demand. AI infrastructure breaks this abstraction in several ways:
- GPUs are not fungible with each other. An H100 and an A10G have radically different performance profiles for inference. Choosing the wrong accelerator for a given model can double latency or triple costs without standard cloud dashboards making it obvious.
- Model loading time is a first-order factor. In traditional web services, container cold starts are measured in seconds. Loading the weights of a large model can take minutes. Reactive autoscaling, which works well for stateless APIs, creates severe bottlenecks when each new instance needs that initialization time.
- Video memory (VRAM) is the real scarce resource. Not host RAM, not CPUs. A model that doesn't fit in a GPU's VRAM must be partitioned across multiple units, introducing communication overhead with no equivalent in microservices architectures.
Latency vs. throughput: a trade-off classic cloud doesn't present this way
In a conventional REST API, optimizing for low latency and optimizing for high throughput are objectives that can be pursued in parallel with relatively standard techniques (caching, CDN, replication). In LLM inference, both objectives structurally conflict.
Dynamic batching, grouping multiple requests in a single model pass, improves throughput but increases latency for each individual request. Inference systems like vLLM or TensorRT-LLM expose levers to manage this trade-off, but they require design decisions that must be made before deployment, not as a later adjustment.
For teams used to cloud providers abstracting these details, the learning curve is steep. An engineer configuring an autoscaling group on EC2 or Cloud Run rarely needs to reason about the internal request scheduler. In inference, that scheduler is critical.
Statefulness and context: another asymmetry
Classic cloud embraced statelessness as a design principle. Applications needing state externalize it to databases or caches. LLMs complicate this: conversation context (message history, attached documents, tool results) grows during the session and must be available for each model call.
In production, this means decisions about where and how to store that context, how to route it to the correct instance (or any instance, if using prefix caching), and how to manage sessions that might last minutes or hours. There's no direct equivalent to a web app's session store: the size, structure, and semantics of context are entirely different.
Who this matters for
Sharma's article is especially relevant for three profiles:
1. Platform teams evaluating whether their existing cloud infrastructure can host inference workloads without structural changes (usual answer: no, or not efficiently).
2. Backend engineers taking on the ML engineer role without specific training in model infrastructure.
3. CTOs and architects making build vs. buy decisions about the inference layer: using managed APIs (like Anthropic's API) versus deploying proprietary models.
There isn't yet an equivalent to AWS's Well-Architected Framework for LLM inference workloads. Each team is building its own conventions, with high trial-and-error costs.
---
We've seen this pattern for months in ClaudeWave projects integrating Claude: problems usually aren't in the model or prompt, but in the infrastructure layer that nobody designed with inference in mind. Sharma's article doesn't offer definitive solutions, but names the problem correctly, which is already quite valuable.
Sources
Read next
Andrew Yang Bets on Startups to Lower the Cost of Living
American entrepreneur and politician Andrew Yang highlights housing, food, and telecom as sectors where startups have real potential to reduce what citizens pay.
SpaceX IPO Has Nothing to Do With Claude
The submitted article covers SpaceX's IPO. ClaudeWave covers the Claude AI ecosystem. There is no justifiable editorial overlap.
Google sues Chinese criminal network that used AI to defraud hundreds of thousands
Google has filed a lawsuit against 'Outsider Enterprise,' a criminal organization that used AI to send 2.5 million fraudulent SMS messages in just two weeks.