Quick answer. ZAYA1-8B is an Apache-2.0 Mixture-of-Experts reasoning model from Zyphra (released May 6, 2026) with 8.4B total and ~760M active parameters. Its headline fact: it was pretrained, midtrained, and fine-tuned end-to-end on 1,024 AMD Instinct MI300X GPUs with no NVIDIA hardware involved, while matching far larger models on math and coding.
For three years the unspoken rule of frontier model training has been: if the cluster is not NVIDIA, the model is not competitive. Every leading lab — OpenAI, Anthropic, Google DeepMind, Meta, Mistral — built its frontier runs on Hopper or Blackwell silicon. On May 6, 2026, Zyphra published ZAYA1-8B and broke that rule: a small, sharp, openly-licensed reasoning model trained start-to-finish on AMD compute, AMD networking, and the ROCm software stack.
The model itself is interesting on its own merits — 8.4B total parameters, under a billion active, competitive with models many times its size on hard math and code. But the story engineers are actually paying attention to is the training stack. This is the first major MoE foundation model with a public technical report that was built entirely without NVIDIA. This guide covers what ZAYA1-8B is, why "trained on AMD" matters, the architecture, the benchmarks (clearly labeled as vendor-reported), how to run it locally, who should care, and where the caveats are.
What is ZAYA1-8B?
ZAYA1-8B is an open-weight Mixture-of-Experts (MoE) language model built by Zyphra and tuned specifically for long-form reasoning, mathematics, and coding. The shape of the model:
| Attribute | Value |
|---|---|
| Total parameters | ~8.4B (technical report rounds to 8B) |
| Active parameters per token | ~760M (report rounds to 700M) |
| Architecture | Mixture-of-Experts (Zyphra MoE++ lineage) |
| Experts | 64 experts (sparse routing) |
| Attention | Compressed Convolutional Attention (CCA), ~8x KV-cache compression |
| Router | MLP-based expert router (not a linear router) |
| License | Apache 2.0 |
| Released | May 6, 2026 |
| Weights | Hugging Face: Zyphra/ZAYA1-8B |
The number that matters for cost is active parameters, not total. Because it is sparse MoE, only ~760M parameters fire per token even though 8.4B sit on disk. That gives it the inference economics of a sub-1B model with the knowledge capacity of an 8B one — which is exactly why Zyphra frames it as "maximum intelligence density per parameter."
Why does "trained on AMD" actually matter?
This is the viral hook, and it is a real one, not marketing spin. Per Zyphra's technical report and corroborating coverage from VentureBeat and AIwire, ZAYA1-8B was trained on:
- Compute: 1,024 AMD Instinct MI300X GPUs (192GB HBM each, 8 per node)
- Networking: AMD Pensando Pollara interconnect plus AMD InfinityFabric
- Software: the ROCm platform — AMD's CUDA equivalent
- Infrastructure: a custom cluster co-built with IBM, on IBM Cloud
Every layer of that stack has a NVIDIA analogue it replaces: MI300X instead of H100/H200, InfinityFabric instead of NVLink, RCCL instead of NCCL, ROCm instead of CUDA. The pipeline ran pretraining, context-extension midtraining, and supervised fine-tuning end-to-end on that stack. There was no NVIDIA hardware anywhere in the loop.
Why engineers care:
- The monopoly cracks. If a credible MoE reasoning model can be trained on AMD silicon and still trade blows with frontier models, the "NVIDIA-or-nothing" assumption that has shaped every infrastructure budget since 2023 stops being a hard constraint. That has real implications for GPU procurement, lead times, and pricing leverage.
- Supply diversification. Teams blocked on NVIDIA allocation now have a proof point that the AMD path is viable for serious training, not just inference.
- Cost. MI300X's 192GB of HBM per GPU changes the memory math for large-context training and reduces sharding overhead versus lower-memory NVIDIA parts.
One honest caveat: the "competitive with frontier models" framing comes from Zyphra (the vendor). The hardware claim — trained entirely on AMD — is independently reported by multiple neutral outlets and backed by the published technical report and AMD's own engineering blog, so treat the hardware story as well-established and the benchmark superiority as vendor-reported until independent evals land.
How is the architecture different from a normal MoE?
ZAYA1-8B is not just a parameter-count flex. Three architectural choices do the heavy lifting:
- Compressed Convolutional Attention (CCA). Standard attention's KV-cache grows with context and dominates memory in long reasoning chains. CCA performs sequence mixing in a compressed latent space, yielding roughly an 8x reduction in KV-cache size versus full multi-head attention. For a reasoning model that generates long chains of thought, that compression is what makes long-context inference cheap.
- MLP-based expert router. Most MoE models pick experts with a single linear layer. Zyphra replaced that with a multi-layer MLP router and stabilized it with a bias-balancing scheme inspired by PID controllers from control theory — addressing the routing instability that plagues MoE training.
- Learned residual scaling. A near-free mechanism that controls residual-norm growth through network depth at negligible parameter and FLOP cost.
On the training data side, Zyphra reports roughly 12 trillion tokens across three stages, including a reasoning-focused midtrain phase at 32K context length for ~1.2T tokens (RoPE base frequency 1M) to extend the context window for long chains of thought.
How good are the benchmarks?
All numbers below are vendor-reported by Zyphra in the model card and technical report. They have not yet been independently reproduced at the time of writing. Treat them as a directional claim, not a settled fact.
| Benchmark | ZAYA1-8B | Comparison | Comparison score |
|---|---|---|---|
| AIME 2025 | 89.1 (base) / 91.9 (Markovian RSA) | DeepSeek-R1-0528 | 87.5 |
| HMMT 2025 | 89.6 | Claude 4.5 Sonnet | 88.3 |
| AIME 2026 | 89.1 | Mistral-Small-4 (119B) | 86.4 |
| HMMT Feb 2026 | 71.6 | Mistral-Small-4 (119B) | 70.6 |
| LiveCodeBench v6 | ~63.8–65.8 | Mistral-Small-4 (119B) | 57.9 |
| GPQA-Diamond | 71.0 | Mistral-Small-4 (119B) | 77.2 |
| MMLU-Pro | 74.2 | Mistral-Small-4 (119B) | 81.6 |
Read this honestly. On math and competition reasoning (AIME, HMMT) ZAYA1-8B is genuinely punching above its weight — beating or matching models 15x larger and, per Zyphra, approaching frontier models like Gemini 2.5 Pro, DeepSeek-V3.2, and GPT-5-High on math with test-time compute. On broad knowledge (MMLU-Pro) and science QA (GPQA-Diamond) a 119B model still wins — an 8.4B model cannot store as many facts, and the numbers reflect that.
One important asterisk on the headline math scores: the top figures (91.9 on AIME'25) use Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only a bounded ~4K-token tail between rounds. It spends substantially more inference compute per problem than a single chain-of-thought call. The single-pass base score (89.1 on AIME'25) is the apples-to-apples number to compare against models running one shot.
Companion guide
For how ZAYA1-8B fits alongside the rest of the open-weight field — Llama, Qwen, DeepSeek, Gemma, and which to pick for which job — see our open-source LLMs landscape for 2026.
How do you run ZAYA1-8B locally?
This is the part that makes ZAYA1-8B a real local story rather than an API-only release. At 8.4B total / BF16, the weights are ~16-17GB, and with 4-bit quantization the active footprint is small enough to run on a single consumer GPU or a recent Apple Silicon Mac.
Important: ZAYA1-8B uses a custom architecture (CCA + MLP router), so it does not run on stock vLLM or stock transformers yet. You must install Zyphra's forks. As of the May 2026 release there is no official Ollama or llama.cpp GGUF from Zyphra — community quantizations (e.g. BNB and MXFP4 builds) have started appearing on Hugging Face but are unofficial. The supported paths are Zyphra's vLLM and transformers forks.
Running with the vLLM fork (recommended for serving)
# Install Zyphra's vLLM fork (custom-arch branch)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr"
# Serve the model on a single GPU
vllm serve Zyphra/ZAYA1-8B --port 8010 \
--mamba-cache-dtype float32 --dtype bfloat16 \
--reasoning-parser qwen3 --enable-auto-tool-choice \
--tool-call-parser zaya_xml
# Multi-GPU (8-way data + expert parallel)
vllm serve Zyphra/ZAYA1-8B --port 8010 -dp 8 -ep \
--mamba-cache-dtype float32 --dtype bfloat16 \
--reasoning-parser qwen3 --enable-auto-tool-choice \
--tool-call-parser zaya_xmlThe endpoint is OpenAI-compatible, so any client that talks to /v1/chat/completions works against http://localhost:8010.
Running with the transformers fork (for scripting)
# Install Zyphra's transformers fork alongside the vLLM fork
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"From there, load Zyphra/ZAYA1-8B with the standard AutoModelForCausalLM / AutoTokenizer pattern. Zyphra's recommended sampling: temperature 1.0 for general use, 0.6 for agentic / code use cases.
What VRAM do you actually need?
| Setup | Approx. VRAM | Realistic on |
|---|---|---|
| BF16, short context | ~18–22GB | 1x MI300X / A6000 / 4090-class with offload |
| 4-bit quantized | ~6–9GB | Consumer 12GB+ GPU, Apple Silicon (16GB+) |
| Long reasoning (32K ctx) | + KV cache (mitigated 8x by CCA) | add headroom for long chains-of-thought |
Exact VRAM figures are not published by Zyphra; the table above is an engineering estimate from the BF16 weight size (~16-17GB) plus typical KV-cache overhead, with CCA's 8x compression keeping long-context cost low. Treat as a starting point and measure on your hardware.
Who should actually care about ZAYA1-8B?
- Teams running on AMD or evaluating it. This is the strongest public datapoint that serious training on MI300X works. If GPU allocation is your bottleneck, this changes your options.
- Anyone building local / on-device reasoning. Sub-1B active params plus an Apache-2.0 license is a rare combination for a model that is actually good at math and code. It is deployable where a 70B model is not.
- Cost-sensitive inference at scale. The active-parameter economics make per-token serving cheap; CCA makes long chains-of-thought cheap. Good for high-volume reasoning workloads.
- Researchers. The full technical report, open weights, and Apache 2.0 license make the architecture (CCA, MLP router, learned residual scaling) reusable, not just observable.
Who should not reach for it: teams that need broad world knowledge or strong general-domain QA. An 8.4B model with 760M active params trades factual breadth for reasoning density. For knowledge-heavy retrieval-augmented workloads a larger model (or RAG over a smaller one) is still the right call.
What are the limitations and open questions?
- Benchmarks are vendor-reported. Zyphra's numbers are strong but not yet independently reproduced. Run your own evals on your own tasks before betting on the comparison claims.
- Headline math scores use heavy test-time compute. The 91.9 AIME'25 figure uses Markovian RSA, which spends far more inference compute than a single pass. Compare on the base (~89.1) number for fair single-shot evaluation.
- Custom architecture, custom forks. No stock vLLM / transformers / Ollama / llama.cpp support at release. You depend on Zyphra's forks until the architecture is upstreamed. Community quantizations exist but are unofficial.
- Narrow strength profile. Excellent at math and code, weaker on broad knowledge and general QA versus much larger models. Pick it for what it is good at.
- New model, thin ecosystem. Tooling, fine-tuning recipes, and deployment guides are still maturing compared to Llama / Qwen / Mistral.
None of these are dealbreakers — they are the normal state of a model that is twelve days old at the time of writing. The hardware story and the reasoning-density story are both real; the rest is ecosystem catching up.
Hiring engineers who know this stack?
Standing up MoE serving on AMD ROCm, validating custom-architecture models, and building cost-efficient local inference is specialized work, and the talent pool for it is thin. If you are hiring vetted remote developers experienced with open-weight LLM deployment, AMD/ROCm infrastructure, or efficient inference engineering, codersera.com/hire matches you with engineers who have shipped this kind of system in production. We run a risk-free trial so you can validate technical fit before committing.
FAQ
Is ZAYA1-8B really trained without any NVIDIA GPUs?
Yes. Per Zyphra's technical report and independent reporting from VentureBeat and AIwire, the entire pipeline — pretraining, context-extension midtraining, and supervised fine-tuning — ran on 1,024 AMD Instinct MI300X GPUs with AMD Pensando Pollara networking and the ROCm software stack, on a cluster co-built with IBM. No NVIDIA hardware was involved at any stage.
How many parameters does ZAYA1-8B have?
About 8.4 billion total parameters with roughly 760 million active per token, because it is a sparse Mixture-of-Experts model with 64 experts. The arXiv technical report rounds these to "8B total / 700M active." The active count is what determines inference cost and speed.
What license is ZAYA1-8B released under?
Apache 2.0 — a permissive license that allows commercial use, modification, and redistribution. The weights are on the Hugging Face Hub at Zyphra/ZAYA1-8B.
Can I run ZAYA1-8B on a consumer GPU?
Yes, with caveats. In BF16 the weights are ~16-17GB (needs ~18-22GB VRAM with overhead); 4-bit quantized it drops to roughly 6-9GB, which fits a 12GB+ consumer GPU or a 16GB+ Apple Silicon Mac. You must use Zyphra's vLLM or transformers forks because the custom architecture is not yet supported by stock runtimes.
Does ZAYA1-8B beat Claude or GPT-5?
On specific math benchmarks, Zyphra reports it matches or slightly exceeds models like Claude 4.5 Sonnet (HMMT'25: 89.6 vs 88.3) and, with test-time compute, approaches frontier models on math. These are vendor-reported and not yet independently reproduced. On broad knowledge and general QA, larger frontier models still win — an 8.4B model trades factual breadth for reasoning density.
Why does training on AMD matter for the industry?
Since 2023, frontier training has effectively required NVIDIA hardware, giving NVIDIA enormous pricing and allocation leverage. ZAYA1-8B is the first major MoE foundation model with a public technical report trained entirely on AMD — proof that a competitive alternative training stack exists. That gives teams blocked on NVIDIA allocation a viable second source and weakens the single-vendor dependency that has shaped AI infrastructure budgets.
What is Markovian RSA and why does it inflate the scores?
Markovian RSA is Zyphra's test-time compute method: it runs parallel reasoning traces and recursively aggregates them while carrying forward only a bounded ~4K-token tail between rounds. It raises ZAYA1-8B's AIME'25 score from ~89.1 (single pass) to 91.9, but spends substantially more inference compute per problem. For a fair single-shot comparison against other models, use the base ~89.1 figure.