SubQ proposes sub-quadratic attention for LLMs: what's the catch?
SubQ is a project that promises to reduce the computational complexity of language models below the standard quadratic threshold. We analyze what we know and what remains unclear.
Quadratic attention has been the most notorious bottleneck in transformers for years: doubling the context length quadruples the computational cost. It's a well-documented problem that has spawned an entire industry of partial solutions—sliding windows, sparse attention, linear architectures like Mamba—without any displacing the standard transformer in general-purpose applications. This week, SubQ appeared on Hacker News with the promise of doing exactly that: sub-quadratic attention for LLMs.
The initial traction on HN has been modest—four points and no comments at publication time—which doesn't rule out a solid technical proposal, but does invite careful reading before drawing conclusions.
What SubQ claims
SubQ's website presents the project as an architecture for language models that reduces attention complexity below the quadratic threshold O(n²) that characterizes conventional transformers. The stated goal is to make models with long contexts computationally viable without the quality compromises that typically accompany approximation methods.
At the time of writing, public documentation is sparse: no linked paper, no reproducible benchmarks, and no public code repository. The proposal exists primarily as a landing page with high-level description.
Why the problem is real and hard
That the problem SubQ addresses is legitimate is beyond dispute. Quadratic attention is the reason context scaling becomes prohibitively expensive in terms of memory and compute time. Anthropic has opted for context windows up to 1M tokens in Claude Opus 4.7, but maintaining that context active carries substantial inference costs that translate directly into per-token pricing.
Existing approaches—FlashAttention, linear attention, hidden state architectures like SSMs—have achieved real improvements but always with some trade-off: quality loss on tasks requiring global attention, greater implementation complexity, or restrictions on the types of dependencies the model can capture.
If SubQ genuinely reduces complexity without those trade-offs, the impact would be significant. But that's precisely the part that cannot be verified yet.
Who would benefit
The practical implications of genuine sub-quadratic attention vary by use case:
- Infrastructure teams managing Claude Code deployments with heavy context loads would notice direct reductions in inference costs, especially in workflows with multiple subagents or long-running sessions.
- Researchers working on proprietary architectures seeking updated technical references to incorporate into their experiments.
- Companies deploying models on-premise with hardware constraints, where quadratic efficiency matters more than in elastic cloud environments.
What's needed to evaluate the proposal
For SubQ to be considered more than a promise, we'd need to see at least three things: a preprint or technical paper with the mathematical formulation of the complexity reduction, comparative benchmarks against standard transformers and existing variants (preferably on standard tasks like perplexity on known evaluation sets), and reproducible code. Without these, any claim about performance or efficiency remains unsubstantiated.
This isn't a judgment on SubQ's team's intentions—the paper may be forthcoming or the public launch may simply be early—but an honest description of the current state of available information.
Broader context
SubQ emerges at a time when interest in alternatives to quadratic attention has resurfaced, driven partly by the race toward longer contexts and the operational costs that entails. Projects like RWKV, Mamba, and GLA have demonstrated that the space of alternative architectures has genuine room for exploration. The question is always the same: does it work as well in practice as in theory?
---
From our perspective, it's worth tracking SubQ if it publishes verifiable technical evidence in the coming weeks. For now, it's a name to add to the list of proposals to review, not a solution ready for production evaluation.
Sources
Read next
Conversational Design for Museums: From Monologue to AI Dialogue
A new preprint proposes a design framework for integrating conversational AI in cultural heritage settings, rethinking how museums share knowledge with visitors.
Will AI Kill the Scientific Paper As We Know It?
An open debate on Marginal Revolution questions whether LLMs are hollowing out the academic paper format. We analyze what's really at stake.
Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules
Anthropic's alignment team publishes a paper on how they teach Claude the reasoning behind its values, not just what to do or avoid.