ZAYA1-8B: Competitive reasoning with under 1B active parameters
Zyphra releases ZAYA1-8B, a 700M active parameter MoE model that matches or exceeds DeepSeek-R1-0528 on mathematics and code benchmarks despite its compact size.
On May 8th, the Zyphra team published the technical report for ZAYA1-8B on arXiv, a reasoning model built on a Mixture-of-Experts (MoE) architecture that maintains only 700M active parameters out of a total 8B. What catches the eye is not the size alone, there are far larger models, but what it achieves with it: matching or exceeding DeepSeek-R1-0528 on several demanding mathematics and code benchmarks, while remaining competitive against considerably larger open-weight models.
This raises a question gaining more weight in the community: when does it stop making sense to scale total parameters if the active parameters per inference can be a fraction of that total?
What makes ZAYA1-8B different
The base architecture is MoE++, proprietary to Zyphra, but the standout design choice lies in training: reasoning was not added as a separate layer through late-stage RLHF, but rather reasoning data was incorporated from pretraining itself. To prevent the model from generating excessively long reasoning chains, Zyphra applied an answer-preserving trimming scheme that trims reasoning traces while keeping the correct answer intact.
Post-training follows a cascade of four reinforcement learning (RL) stages:
1. Reasoning warmup on mathematics and puzzles.
2. RLVE-Gym: a curriculum of 400 tasks.
3. Math and code RL with computation traces at inference time and synthetic code environments built from competitive programming references.
4. Behavioral RL focused on chat and instruction following.
Each stage is designed not just to teach the model how to solve problems, but to structure the solution process in a way that is useful at inference time.
Markovian RSA: a test-time compute gambit
Perhaps the paper's most novel methodological contribution is Markovian RSA (Recursive Sequential Aggregation), a test-time compute method that recursively aggregates multiple parallel reasoning traces. What sets it apart from similar approaches is the Markovian constraint: between aggregation rounds, the model carries only a bounded-length reasoning queue instead of accumulating the entire history. This reduces context cost without sacrificing coherence in chained reasoning.
For those working on deployments where per-token inference cost matters, which is virtually any production at scale, this type of technique has immediate practical relevance.
Why this matters beyond benchmarks
The entire training process, pretraining, midtraining, and SFT, was conducted on AMD infrastructure: compute, networking, and software. Zyphra makes no mention of NVIDIA GPUs anywhere in the report. This is not a minor detail: if the results are independently replicated, it adds evidence to the thesis that the AMD ecosystem is maturing as a genuine alternative for training models at this scale, something that until recently was more promise than reality.
As for who finds this work most useful: teams needing serious reasoning capabilities in memory or latency-constrained environments, researchers studying RL methods for reasoning, and anyone evaluating open-weight alternatives to proprietary models for mathematics or code tasks.
ZAYA1-8B is neither the largest nor the most visible model at the moment, but its technical report ranks among the densest in justified design decisions we have read so far in 2026. It deserves careful reading before dismissing it based on its name.
Sources
Read next
Conversational Design for Museums: From Monologue to AI Dialogue
A new preprint proposes a design framework for integrating conversational AI in cultural heritage settings, rethinking how museums share knowledge with visitors.
Will AI Kill the Scientific Paper As We Know It?
An open debate on Marginal Revolution questions whether LLMs are hollowing out the academic paper format. We analyze what's really at stake.
Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules
Anthropic's alignment team publishes a paper on how they teach Claude the reasoning behind its values, not just what to do or avoid.