DiffusionGemma 26B-A4B: Google’s Text-Diffusion LLM

Quick answer. DiffusionGemma 26B-A4B is Google DeepMind's first open-weight text-diffusion language model, released June 10, 2026 under Apache 2.0. It is a 25.2B-parameter Mixture-of-Experts (3.8B active) built on the Gemma 4 backbone that generates text in parallel blocks rather than token-by-token, delivering up to 4x faster output — over 1,100 tokens per second on an H100.

Almost every large language model you have used generates text one token at a time, left to right. DiffusionGemma breaks that pattern. It is the first open-weight model from Google DeepMind to use discrete text diffusion, refining whole blocks of tokens in parallel the way image-diffusion models denoise pixels. The result is a genuinely different speed profile — and a set of trade-offs worth understanding before you reach for it.

What is DiffusionGemma 26B-A4B?

DiffusionGemma is a diffusion-based text generation model in Google's Gemma family, released on June 10, 2026 under the permissive Apache 2.0 license. It shares the Gemma 4 backbone — the same 26B-A4B Mixture-of-Experts architecture (25.2B total parameters, 3.8B active per forward pass, 128 experts with 8 activated plus one shared expert) — but swaps autoregressive decoding for a block-autoregressive diffusion sampler.

It is multimodal, accepting interleaved text, image, and video inputs (up to 60 seconds of video) and generating text outputs. The context window is 256K tokens, and the model ships with out-of-the-box support for 35+ languages after being pre-trained on 140+.

How is text diffusion different from autoregressive generation?

A standard LLM predicts the next token, appends it, then repeats — a strictly sequential loop. A diffusion LLM starts from a masked or noisy "canvas" of tokens and denoises the entire block over a handful of steps, filling in many positions at once.

DiffusionGemma uses a hybrid called Block Autoregressive Diffusion: it denoises a 256-token block in parallel (Google reports roughly 15-20 tokens committed per forward pass), then once that block is finalized it is written to the KV cache and a fresh canvas is conditioned on the history. This pairs the raw throughput of parallel decoding with the stability of sequential context — you get diffusion speed without losing coherence across long outputs.

What are DiffusionGemma's specs and benchmarks?

Attribute	DiffusionGemma 26B-A4B
Developer	Google DeepMind
Release date	June 10, 2026
Total parameters	25.2B (MoE)
Active parameters	3.8B
Experts	128 total, 8 active + 1 shared
Generation method	Discrete text diffusion (block-autoregressive)
Context window	256K tokens
Modalities	Text, image, video (in) → text (out)
Languages	35+ out of the box, 140+ pre-trained
Throughput	1,100+ tokens/sec (H100, FP8, low batch)
License	Apache 2.0

On Google's published instruction-tuned evaluations (using the Entropy Bound sampler), DiffusionGemma posts competitive reasoning and coding numbers for its active-parameter footprint:

Benchmark	Score
MMLU Pro	77.6%
GPQA Diamond	73.2%
MATH-Vision	70.5%
LiveCodeBench v6	69.1%
AIME 2026 (no tools)	69.1%

How fast is DiffusionGemma, really?

Speed is the headline. In low-batch settings, Google reports per-user generation exceeding 1,100 tokens per second on an H100 in FP8, and community testing puts an RTX 5090 in the 700+ tokens/sec range. The parallel block sampler is what makes this possible: instead of one token per forward pass, DiffusionGemma commits 15-20 at a time over a recommended 48 denoising steps. Google frames the overall speedup as up to 4x versus a comparable autoregressive model.

How do you run DiffusionGemma?

The weights live on Hugging Face as google/diffusiongemma-26B-A4B-it. Because diffusion decoding is unusual, tooling support matters more than usual — and DiffusionGemma is notable for being the first diffusion LLM natively supported in vLLM.

vLLM — vllm serve "google/diffusiongemma-26B-A4B-it" exposes an OpenAI-compatible endpoint.
Transformers — load via AutoProcessor.from_pretrained() and the DiffusionGemmaForBlockDiffusion class.
SGLang — python -m sglang.launch_server --model-path "google/diffusiongemma-26B-A4B-it".
Local runtimes — quantized builds are available for llama.cpp, Ollama, LM Studio, MLX, and Unsloth. Community 4-bit builds are reported to fit inside roughly 18 GB of VRAM, which puts local inference within reach of a single high-end consumer GPU.

What are the trade-offs?

DiffusionGemma optimizes for speed and parallel layout generation, and it is honest about the cost: Google notes that its overall output quality sits below standard autoregressive Gemma 4, and recommends the regular Gemma 4 line for maximum-quality production work. Think of DiffusionGemma as the right pick when throughput, latency, or on-device responsiveness matters more than squeezing out the last few points of quality — high-volume drafting, autocomplete-style UX, or agent loops where you regenerate often. For a broader view of where it sits among open models, see our open-source LLMs landscape for 2026 and the complete Gemma 4 guide.

Frequently asked questions

Is DiffusionGemma open source?

Yes. The weights are released under Apache 2.0, so you can use, modify, and deploy the model commercially. They are hosted on Hugging Face as google/diffusiongemma-26B-A4B-it.

How many parameters does DiffusionGemma have?

It has 25.2B total parameters in a Mixture-of-Experts layout, with 3.8B active on any given forward pass (128 experts, 8 activated plus one shared expert).

Is a diffusion LLM better than an autoregressive one?

Not strictly better — different. DiffusionGemma is substantially faster because it decodes blocks of tokens in parallel, but Google reports its output quality is lower than autoregressive Gemma 4. Choose it when speed and throughput matter most; choose standard Gemma 4 when you need peak quality.

What context window does DiffusionGemma support?

Up to 256K tokens, and it accepts interleaved text, image, and video inputs (video up to about 60 seconds) while producing text output.

Can DiffusionGemma run on a consumer GPU?

Yes, in quantized form. Community 4-bit builds are reported to fit within roughly 18 GB of VRAM, and testers have run it on cards like the RTX 5090 at 700+ tokens per second.

Building with fast open models?

New architectures like text diffusion reward teams that can evaluate, integrate, and ship quickly. Codersera connects you with vetted remote developers and ML engineers who can extend your team and cut hiring risk — so you can move on models like DiffusionGemma while they are still new. Hire vetted remote developers with Codersera.