gemma 3

Gemma 4 vs Gemma 3 vs Gemma 3n: the full comparison (2026)

Published 21 May 2025 • Updated 11 May 2026 • 10 min read

Quick answer. Gemma 4 (April 2026) supersedes Gemma 3 (March 2025) and the mobile-focused Gemma 3n (June 2025). Gemma 4 31B hits 89.2% on AIME and 66.4% on RULER@128K, against 20.8% and 13.5% for Gemma 3 27B. The 26B A4B MoE beats Gemma 3 27B on every benchmark in less VRAM. Gemma 4 ships under Apache 2.0.

Last updated April 2026 — refreshed for current model/tool versions.

Google's open-weights Gemma family expanded again on April 2, 2026 with the launch of Gemma 4, built on Gemini 3 research and released under the permissive Apache 2.0 license. Gemma 4 supersedes both Gemma 3 (March 2025) and the mobile-first Gemma 3n (June 2025), but the older models still ship in production and remain useful reference points. This guide compares all three generations head-to-head, with current 2026 numbers, so you can pick the right Gemma for cloud, edge, or on-device workloads.

What changed in 2026: Gemma 4 replaces Gemma 3's 1B/4B/12B/27B lineup with E2B, E4B, 26B MoE (A4B), and 31B Dense; context jumps to 256K tokens (medium tier) or 128K (small tier); audio input becomes native on E2B/E4B; the license moves from Google's restrictive Gemma terms to Apache 2.0. The 31B Dense model now scores 85.2 on MMLU-Pro and ~1452 LMArena Elo, a roughly 87-point jump over Gemma 3 27B.

Want the full picture? Read our continuously-updated Gemma 4 Complete Guide (2026) — small-footprint open weights, on-device deployment, and benchmarks.

TL;DR

Use Gemma 4 31B Dense for workstation-class reasoning, long-context analysis, and agentic workflows. It's the new flagship and beats Gemma 3 27B on every public benchmark.
Use Gemma 4 26B MoE (A4B) for consumer-GPU deployment — only 3.8B active parameters per token, comparable quality to the 31B Dense on most tasks.
Use Gemma 4 E4B for edge devices, laptops without a GPU, and high-quality on-device inference.
Use Gemma 4 E2B for phones and resource-constrained IoT — direct successor to Gemma 3n E2B with native audio.
Stay on Gemma 3 / Gemma 3n only if you've already shipped a fine-tune you don't want to redo. Otherwise, migrate.

For a wider look at how Google's open lineup stacks up against frontier closed models, see our pillar comparison DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026), which benchmarks Gemma 4 alongside DeepSeek V4, Claude 4.7, GPT-5.5, and Qwen 3.6 on real coding workloads.

The Gemma family in 2026

Gemma 4 (April 2, 2026) — current flagship

Gemma 4 is the first Gemma generation built on Gemini 3 research. It ships in four sizes designed to span phone-to-workstation deployment without changing prompts or tooling:

Gemma 4 E2B — effective 2B parameters, 128K context, text/image/video/audio in. Targets phones and lightweight edge devices.
Gemma 4 E4B — effective 4B parameters, 128K context, text/image/video/audio in. Targets laptops, browsers (via WebGPU), and edge accelerators.
Gemma 4 26B MoE (A4B) — 26B total parameters, ~3.8B active per token, 256K context, text/image/video in. Targets consumer GPUs (RTX 4090 / 5090 class).
Gemma 4 31B Dense — 31B parameters, 256K context, text/image/video in. Targets workstation GPUs and small inference servers.

All four ship with configurable thinking modes, native system prompts, and built-in function calling. Licensing is Apache 2.0 — no usage caps, no field-of-use restrictions, drop-in for commercial production.

Gemma 3 (March 2025) — previous generation

Gemma 3 introduced multimodality (text + image + short video) to the Gemma line and shipped in 1B / 4B / 12B / 27B sizes. Its 27B variant was, at launch, the highest-Elo open model that could fit on a single H100 — peaking at 1339 LMSys Arena Elo and 67.5 MMLU-Pro. Architecturally it's a standard transformer with Grouped Query Attention, QK-norm, and a 5:1 local/global interleaved attention pattern (1024-token local windows). Context: 32K on 1B, 128K on 4B/12B/27B.

Gemma 3n (June 2025) — mobile experiment, now superseded by Gemma 4 E-series

Gemma 3n was the on-device sibling to Gemma 3, introducing the MatFormer (Matryoshka Transformer) architecture, Per-Layer Embedding (PLE) caching, and conditional parameter loading. The headline trick: an 8B "E4B" model contained a fully functional 5B "E2B" sub-model, giving runtime size selection. Effective memory was 2 GB (E2B) and 3 GB (E4B). Native audio via the Universal Speech Model encoder and 60 fps video via MobileNet-V5 made it the first Gemma usable for real-time mobile assistants.

In Gemma 4, the MatFormer ideas have been folded into the E2B / E4B tier directly, so there is no separate "Gemma 4n" — Gemma 4 E2B/E4B is the successor to Gemma 3n.

Head-to-head benchmarks (2026)

Reasoning, knowledge, and math

Benchmark	Gemma 3 27B	Gemma 3n E4B	Gemma 4 31B Dense	Gemma 4 26B MoE
MMLU-Pro	67.5	~55	85.2	~83
GPQA Diamond	42.4	n/a	84.3	~80
AIME 2026 (math)	n/a	n/a	89.2	~85
LiveCodeBench v6	29.7 (v3)	n/a	80.0	~76
Codeforces Elo	~110	n/a	2150	~1900
LMArena (human pref) Elo	1339	~1180	~1452 (top 3)	~1430

The MMLU-Pro jump from 67.5 → 85.2 between Gemma 3 27B and Gemma 4 31B is one of the largest single-generation gains in any open-weights line. The Codeforces Elo move (110 → 2,150) is even more dramatic — Gemma 3 27B could barely clear beginner problems, while Gemma 4 31B reaches expert-level competitive programming.

Multimodal capability and context

Capability	Gemma 3 (4B/12B/27B)	Gemma 3n (E2B/E4B)	Gemma 4 E2B/E4B	Gemma 4 26B MoE / 31B Dense
Context window	128K	32K	128K	256K
Image input	Yes	Yes	Yes (variable resolution)	Yes (variable resolution)
Video input	Short clips	60 fps streaming	Yes (variable aspect ratio)	Yes (variable aspect ratio)
Audio input	No	Yes (USM encoder)	Yes (native)	No
Languages	140+	140+	140+	140+
Function calling	External tooling required	External tooling required	Built-in	Built-in
Thinking modes	No	No	Configurable	Configurable

Deployment footprint

Model	Total params	Active params (MoE)	Min RAM/VRAM	Suitable target
Gemma 4 E2B	~5B	~2B effective	2–3 GB	Phones, IoT, browsers
Gemma 4 E4B	~8B	~4B effective	3–4 GB	Laptops, edge boxes
Gemma 4 26B MoE	26B	3.8B active	~16 GB (Q4)	RTX 4090 / 5090, Macs (M3 Max+)
Gemma 4 31B Dense	31B	—	~22 GB (Q4) / 64 GB (FP16)	Workstation GPUs, single H100
Gemma 3 27B	27B	—	~18 GB (Q4) / 54 GB (FP16)	Workstation GPUs
Gemma 3n E4B	8B	~4B effective	3 GB	Mid-range phones

Architecture deep dive

Gemma 4: dense + MoE under one umbrella

Gemma 4 keeps the GQA + QK-norm + RoPE foundation from Gemma 3 but adds three things:

Mixture-of-Experts at the 26B size. The A4B variant routes each token to a small subset of experts so only ~3.8B parameters are active per forward pass. This is what lets a 26B model run on a single 24 GB consumer GPU at Q4.
Configurable thinking modes. The model supports an explicit "think" budget at inference time, similar to o1/o3-style reasoning, with the budget exposed as a sampling parameter rather than a separate model checkpoint.
Native multimodal heads. Image, video (variable aspect ratio), and on E2B/E4B audio inputs all share the same backbone — no adapter modules to load or unload.

Gemma 3: standard transformer, multimodal text+image

Gemma 3 uses interleaved local/global attention (5 local layers per global layer, 1024-token local windows) and RoPE with a 1M base frequency for the 128K-context variants. The 1B variant is text-only and was sized specifically for mobile (529 MB on disk after quantization).

Gemma 3n: MatFormer + PLE caching

The MatFormer trick was nesting a smaller model entirely inside a bigger one — E4B contained a real, separately-runnable E2B — so a single download could serve two quality tiers. Per-Layer Embedding caching offloaded embedding tables to disk/CPU and pulled them in lazily, cutting peak memory by ~40%. Conditional parameter loading skipped vision or audio weights when those modalities weren't needed.

Gemma 4 keeps the spirit of MatFormer — runtime-selectable effective size — but bakes it into the E-series sizing (E2B, E4B) rather than as a "model contains another model" architecture. For most users this is simpler: pick the size, ship it.

When to use which Gemma

Choose Gemma 4 31B Dense when…

You need the strongest open-weights reasoning available on a single workstation GPU.
You're processing 100K+ token documents and need the 256K context window.
You're building agentic workflows that rely on the built-in function calling and configurable thinking modes.
You'd otherwise reach for DeepSeek V4 or Qwen 3.6 32B but want Apache 2.0 with no usage caps.

Choose Gemma 4 26B MoE (A4B) when…

You're constrained to a single consumer GPU (RTX 4090, 5090, or an M-series Mac with 32+ GB unified memory).
You need throughput closer to a 4B model but quality closer to a 26B model.
You're running multi-tenant inference and want lower per-token cost.

Choose Gemma 4 E4B when…

You're shipping on-device AI on a laptop, browser (WebGPU), or edge accelerator.
You need audio + vision + text on the same model with no cloud round-trip.
Privacy or offline operation is non-negotiable.

Choose Gemma 4 E2B when…

The target is a phone, smartwatch, or microcontroller-class device.
Energy budget is the binding constraint (E-series is what Gemma 3n's 0.75 %-battery-per-25-conversations result was measured on; Gemma 4 E2B is in the same envelope).

Stay on Gemma 3 or 3n only when…

You have an existing fine-tune in production and the migration cost outweighs the quality gain.
You depend on the exact 1B text-only variant of Gemma 3 (Gemma 4's smallest tier is E2B, not 1B).
You depend on the explicit MatFormer-nested-model behaviour from Gemma 3n. Gemma 4 doesn't expose this directly.

Historical: the original Gemma 3 vs Gemma 3n analysis

The rest of this section preserves the 2025-era comparison between Gemma 3 and Gemma 3n for readers maintaining systems on those models. All numbers below are point-in-time; for current-generation deployments, use Gemma 4.

Gemma 3 (legacy)

Built on Gemini 2.0 research. Released March 12, 2025. Sizes: 1B (text-only, 529 MB), 4B (multimodal, 128K context), 12B, 27B. The 27B variant hit 1339 LMSys Arena Elo and 67.5 MMLU-Pro at launch — top-10 globally and top open-weights model that could fit on a single H100. Other 27B numbers worth keeping on file:

Benchmark	Gemma 3 27B (2025)
MMLU-Pro	67.5
LiveCodeBench (v3)	29.7
Bird-SQL	54.4
GPQA Diamond	42.4
MATH	69.0
FACTS Grounding	74.9
MMMU	64.9
SimpleQA	10.0
LMSys Arena Elo	1339

Gemma 3n (legacy)

Released June 26, 2025. The MatFormer-based mobile-first sibling. Effective 2B (E2B) and 4B (E4B) memory footprints despite 5B and 8B total parameters. Used Universal Speech Model for audio, MobileNet-V5 for 60 fps video. Real-world numbers from the launch:

Inference: up to 2,585 tokens/sec on Pixel 9 Pro (E2B INT4).
Energy: 0.75 % battery for 25 conversations (Pixel 9 Pro).
Audio encode rate: 6.25 tokens/second.
Video: 60 fps real-time analysis with MobileNet-V5.

Both Gemma 3 and Gemma 3n shipped under Google's custom Gemma license, which had usage caps and field-of-use carve-outs. Gemma 4 drops that for Apache 2.0 — meaningful if you're shipping in regulated industries or want frictionless commercial redistribution.

Running Gemma 4 in 2026

Quick start (Hugging Face)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Gemma 4 31B Dense — workstation GPU
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain MoE routing in 3 sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Deployment options

Cloud / managed: Vertex AI Model Garden, AWS Bedrock (Gemma 4 listed April 2026), Hugging Face Inference Endpoints.
Local server: vLLM 0.10+ supports Gemma 4 MoE routing; llama.cpp added GGUF support on day one.
Mac: MLX 0.20+ has dedicated kernels for the 26B MoE; Ollama ships gemma4, gemma4:e4b, and gemma4:31b tags.
Mobile: Google AI Edge SDK is the official path for E2B/E4B on Android; MediaPipe LLM API wraps it. iOS uses Core ML conversions.
Browser: WebLLM and Transformers.js support E2B/E4B via WebGPU.

Quantization in practice

Q4_K_M (GGUF): 31B fits on a single 24 GB GPU; quality loss ~1–2 % on MMLU-Pro.
Q8_0: recommended for the 26B MoE — INT4 hurts router accuracy more than dense layers.
FP8: H100 / B200 production serving, near-lossless.
E2B/E4B INT4: the on-device default. Targets 2–4 GB RAM.

Fine-tuning

LoRA / QLoRA via Hugging Face PEFT works on all four sizes. For Gemma 4 26B MoE, freeze the routing layers — fine-tuning them with small datasets typically degrades quality. For E2B/E4B on-device adaptation, Google's Edge SDK now exposes a federated-learning hook that was experimental in the Gemma 3n era.

FAQ

Should I migrate from Gemma 3 27B to Gemma 4 31B?

Yes if you care about quality. The 31B is a real generational leap — MMLU-Pro 67.5 → 85.2, Codeforces Elo ~110 → ~2,150, LMArena 1339 → ~1452. The license switch to Apache 2.0 is also a meaningful unblock if you're in a regulated industry. The migration cost is mostly re-running fine-tunes against a new tokenizer and prompt template.

Is there a Gemma 4n?

No. The mobile-first work that defined Gemma 3n has been merged into Gemma 4's E2B and E4B sizes. There is no separate "n" SKU in the Gemma 4 generation.

How does Gemma 4 compare to Llama 4 and Qwen 3.6?

On MMLU-Pro, Qwen 3.5/3.6 leads at ~86.1, Gemma 4 31B follows at 85.2, Llama 4 trails. On AIME 2026, Gemma 4 31B (89.2) edges Llama 4 (88.3). On LiveCodeBench v6, Gemma 4 31B (80.0) beats Llama 4 (77.1). For coding specifically, see DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) for a fuller picture across closed and open models.

What's the license?

Gemma 4: Apache 2.0. No usage caps, no field-of-use restrictions. Gemma 3 and Gemma 3n: Google's custom Gemma Terms of Use, with prohibited-use clauses and a "we can update terms" provision.

How big is the context window in practice?

The 26B MoE and 31B Dense advertise 256K tokens. Effective recall holds well to ~128K in our internal needle-in-a-haystack tests; degrades past 200K. For the E-series, 128K advertised and ~64K effective.

When should I enable thinking mode?

Multi-step math, code generation longer than ~50 lines, and tool-use chains. Skip it for chit-chat and simple extraction — the latency cost is real.

DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) — pillar comparison covering Gemma 4, DeepSeek V4, Claude 4.7, GPT-5.5, and Qwen 3.6.
How to Run Gemma on a Mac — step-by-step local deployment (now refreshed for Gemma 4 via MLX/Ollama).
How to Run Gemma on Windows — local deployment on Windows with WSL2 and DirectML.
Gemma vs Qwen: open-source LLM comparison — Apache 2.0 vs Tongyi Qianwen license tradeoffs and benchmark deltas.