ZAYA1-8B: Competitive reasoning with under 1B active parameters

On May 8th, the Zyphra team published the technical report for ZAYA1-8B on arXiv, a reasoning model built on a Mixture-of-Experts (MoE) architecture that maintains only 700M active parameters out of a total 8B. What catches the eye is not the size alone, there are far larger models, but what it achieves with it: matching or exceeding DeepSeek-R1-0528 on several demanding mathematics and code benchmarks, while remaining competitive against considerably larger open-weight models.

This raises a question gaining more weight in the community: when does it stop making sense to scale total parameters if the active parameters per inference can be a fraction of that total?

What makes ZAYA1-8B different

The base architecture is MoE++, proprietary to Zyphra, but the standout design choice lies in training: reasoning was not added as a separate layer through late-stage RLHF, but rather reasoning data was incorporated from pretraining itself. To prevent the model from generating excessively long reasoning chains, Zyphra applied an answer-preserving trimming scheme that trims reasoning traces while keeping the correct answer intact.

Post-training follows a cascade of four reinforcement learning (RL) stages:

1. Reasoning warmup on mathematics and puzzles.
2. RLVE-Gym: a curriculum of 400 tasks.
3. Math and code RL with computation traces at inference time and synthetic code environments built from competitive programming references.
4. Behavioral RL focused on chat and instruction following.

Each stage is designed not just to teach the model how to solve problems, but to structure the solution process in a way that is useful at inference time.

Markovian RSA: a test-time compute gambit

Perhaps the paper's most novel methodological contribution is Markovian RSA (Recursive Sequential Aggregation), a test-time compute method that recursively aggregates multiple parallel reasoning traces. What sets it apart from similar approaches is the Markovian constraint: between aggregation rounds, the model carries only a bounded-length reasoning queue instead of accumulating the entire history. This reduces context cost without sacrificing coherence in chained reasoning.

For those working on deployments where per-token inference cost matters, which is virtually any production at scale, this type of technique has immediate practical relevance.

Why this matters beyond benchmarks

The entire training process, pretraining, midtraining, and SFT, was conducted on AMD infrastructure: compute, networking, and software. Zyphra makes no mention of NVIDIA GPUs anywhere in the report. This is not a minor detail: if the results are independently replicated, it adds evidence to the thesis that the AMD ecosystem is maturing as a genuine alternative for training models at this scale, something that until recently was more promise than reality.

As for who finds this work most useful: teams needing serious reasoning capabilities in memory or latency-constrained environments, researchers studying RL methods for reasoning, and anyone evaluating open-weight alternatives to proprietary models for mathematics or code tasks.

ZAYA1-8B is neither the largest nor the most visible model at the moment, but its technical report ranks among the densest in justified design decisions we have read so far in 2026. It deserves careful reading before dismissing it based on its name.

ZAYA1-8B: Competitive reasoning with under 1B active parameters

What makes ZAYA1-8B different

Markovian RSA: a test-time compute gambit

Why this matters beyond benchmarks

Sources

Read next

Conversational Design for Museums: From Monologue to AI Dialogue

Will AI Kill the Scientific Paper As We Know It?

Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules