SMG: Why Separate CPU and GPU in LLM Serving
LightSeek proposes SMG, an architecture that decouples CPU processing from GPU in language model serving to reduce costs and improve performance.
Serving a language model at scale has a structural problem that rarely surfaces outside infrastructure teams: the GPU performs two very different jobs, prefill and decode, with completely different memory and compute requirements, and forcing them to coexist on the same hardware carries a real cost. The LightSeek team just published a detailed technical analysis on the official PyTorch blog of SMG (Separated Memory and GPU), an architecture that proposes decoupling both phases to serve them independently.
The article, available at pytorch.org/blog/lightseek-smg, arrives at a moment when inference cost remains one of the most cited bottlenecks by teams deploying models in production, and offers concrete engineering insight into how to reorganize the serving stack.
Prefill and Decode: Two Different Problems on the Same Hardware
When processing a request to an LLM, the work divides into two well-defined phases. Prefill is compute-intensive: it processes the entire prompt in parallel and generates the first token. Decode is iterative and sequential: it generates one token at a time, with high memory pressure to maintain the KV cache. Mixing both phases on the same GPUs creates what the literature calls GPU bubbles: periods where hardware remains underutilized because the memory access patterns of both phases interfere with each other.
SMG tackles this by moving the decode phase, or parts of it, especially KV cache management, to CPUs or system memory, freeing the GPU to focus on the dense compute work of prefill. The intuition behind this idea is not new, but LightSeek's proposal adds implementation details on how to coordinate data transfer between CPU and GPU without introducing latencies that negate the benefit.
What SMG Proposes Exactly
The SMG architecture decouples three key elements:
- KV cache management on CPU: the KV cache grows with context length. Storing it in system memory, cheaper and more abundant than VRAM, allows serving long contexts without scaling GPU count linearly.
- Phase-separated scheduling: the request scheduler differentiates between jobs in prefill phase and jobs in decode, assigning them to separate resources and avoiding contention.
- Asynchronous transfer: KV cache blocks move between CPU and GPU asynchronously during cycles when the GPU doesn't need them, minimizing impact on generation latency.
Who This Matters For
This proposal is relevant mainly to three profiles. First, teams deploying their own models on proprietary infrastructure who have room to modify their serving stack; for them, SMG offers a path to squeeze more performance from existing hardware before scaling horizontally. Second, inference-as-a-service providers who charge per token and where each efficiency gain has direct impact on margins. Third, teams working with long context windows, like those offered by Claude Opus 4.7 with its 1M token window, where KV cache pressure is especially severe and the cost of keeping it in VRAM becomes prohibitive.
For small projects or those using managed APIs, the impact is indirect: if providers adopt techniques like SMG, cost per token should decline over time.
Integration with the PyTorch Ecosystem
That the article appears on the official PyTorch blog is not a minor detail. It means LightSeek chose to publish within the reference ecosystem for Python model development, making it easier for the proposal to be evaluated and integrated by other projects already using PyTorch as a foundation. Serving frameworks like vLLM or SGLang, which run on PyTorch, are natural candidates to incorporate SMG ideas if results hold up under broader testing.
---
From our perspective, the proposal seems solid in its diagnosis: contention between prefill and decode is a real and documented problem, and addressing it at the architectural level has more runway than continuing to add VRAM. Whether the implementation details withstand community scrutiny will determine if SMG moves from interesting paper to adopted practice.
Sources
Read next
WebRTC Sabotages Voice Prompts: Why Video Call Protocol Fails for LLMs
WebRTC discards audio packets to keep latency low, reasonable for video calls but catastrophic when that audio contains a prompt for a language model.
GPT-5.5 Instant: OpenAI Claims 52.5% Fewer Hallucinations, But the Data Is Theirs
OpenAI says its new default ChatGPT model cuts hallucinations by 52.5% compared to its predecessor. The figures come from internal evaluations only.
When LLMs Help Design Pathogens: The Biosecurity and AI Debate
The Economist examines how language models can lower the technical barrier to creating dangerous biological agents. A debate that directly shapes how safety guardrails are designed.