Gemma 4 vs Gemma 3 vs Gemma 3n: the full comparison (2026)

Last updated April 2026 — refreshed for current model/tool versions.

Google's open-weights Gemma family expanded again on April 2, 2026 with the launch of Gemma 4, built on Gemini 3 research and released under the permissive Apache 2.0 license. Gemma 4 supersedes both Gemma 3 (March 2025) and the mobile-first Gemma 3n (June 2025), but the older models still ship in production and remain useful reference points. This guide compares all three generations head-to-head, with current 2026 numbers, so you can pick the right Gemma for cloud, edge, or on-device workloads.

What changed in 2026: Gemma 4 replaces Gemma 3's 1B/4B/12B/27B lineup with E2B, E4B, 26B MoE (A4B), and 31B Dense; context jumps to 256K tokens (medium tier) or 128K (small tier); audio input becomes native on E2B/E4B; the license moves from Google's restrictive Gemma terms to Apache 2.0. The 31B Dense model now scores 85.2 on MMLU-Pro and ~1452 LMArena Elo, a roughly 87-point jump over Gemma 3 27B.

TL;DR

  • Use Gemma 4 31B Dense for workstation-class reasoning, long-context analysis, and agentic workflows. It's the new flagship and beats Gemma 3 27B on every public benchmark.
  • Use Gemma 4 26B MoE (A4B) for consumer-GPU deployment — only 3.8B active parameters per token, comparable quality to the 31B Dense on most tasks.
  • Use Gemma 4 E4B for edge devices, laptops without a GPU, and high-quality on-device inference.
  • Use Gemma 4 E2B for phones and resource-constrained IoT — direct successor to Gemma 3n E2B with native audio.
  • Stay on Gemma 3 / Gemma 3n only if you've already shipped a fine-tune you don't want to redo. Otherwise, migrate.

For a wider look at how Google's open lineup stacks up against frontier closed models, see our pillar comparison DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026), which benchmarks Gemma 4 alongside DeepSeek V4, Claude 4.7, GPT-5.5, and Qwen 3.6 on real coding workloads.

The Gemma family in 2026

Gemma 4 (April 2, 2026) — current flagship

Gemma 4 is the first Gemma generation built on Gemini 3 research. It ships in four sizes designed to span phone-to-workstation deployment without changing prompts or tooling:

  • Gemma 4 E2B — effective 2B parameters, 128K context, text/image/video/audio in. Targets phones and lightweight edge devices.
  • Gemma 4 E4B — effective 4B parameters, 128K context, text/image/video/audio in. Targets laptops, browsers (via WebGPU), and edge accelerators.
  • Gemma 4 26B MoE (A4B) — 26B total parameters, ~3.8B active per token, 256K context, text/image/video in. Targets consumer GPUs (RTX 4090 / 5090 class).
  • Gemma 4 31B Dense — 31B parameters, 256K context, text/image/video in. Targets workstation GPUs and small inference servers.

All four ship with configurable thinking modes, native system prompts, and built-in function calling. Licensing is Apache 2.0 — no usage caps, no field-of-use restrictions, drop-in for commercial production.

Gemma 3 (March 2025) — previous generation

Gemma 3 introduced multimodality (text + image + short video) to the Gemma line and shipped in 1B / 4B / 12B / 27B sizes. Its 27B variant was, at launch, the highest-Elo open model that could fit on a single H100 — peaking at 1339 LMSys Arena Elo and 67.5 MMLU-Pro. Architecturally it's a standard transformer with Grouped Query Attention, QK-norm, and a 5:1 local/global interleaved attention pattern (1024-token local windows). Context: 32K on 1B, 128K on 4B/12B/27B.

Gemma 3n (June 2025) — mobile experiment, now superseded by Gemma 4 E-series

Gemma 3n was the on-device sibling to Gemma 3, introducing the MatFormer (Matryoshka Transformer) architecture, Per-Layer Embedding (PLE) caching, and conditional parameter loading. The headline trick: an 8B "E4B" model contained a fully functional 5B "E2B" sub-model, giving runtime size selection. Effective memory was 2 GB (E2B) and 3 GB (E4B). Native audio via the Universal Speech Model encoder and 60 fps video via MobileNet-V5 made it the first Gemma usable for real-time mobile assistants.

In Gemma 4, the MatFormer ideas have been folded into the E2B / E4B tier directly, so there is no separate "Gemma 4n" — Gemma 4 E2B/E4B is the successor to Gemma 3n.

Head-to-head benchmarks (2026)

Reasoning, knowledge, and math

BenchmarkGemma 3 27BGemma 3n E4BGemma 4 31B DenseGemma 4 26B MoE
MMLU-Pro67.5~5585.2~83
GPQA Diamond42.4n/a84.3~80
AIME 2026 (math)n/an/a89.2~85
LiveCodeBench v629.7 (v3)n/a80.0~76
Codeforces Elo~110n/a2150~1900
LMArena (human pref) Elo1339~1180~1452 (top 3)~1430

The MMLU-Pro jump from 67.5 → 85.2 between Gemma 3 27B and Gemma 4 31B is one of the largest single-generation gains in any open-weights line. The Codeforces Elo move (110 → 2,150) is even more dramatic — Gemma 3 27B could barely clear beginner problems, while Gemma 4 31B reaches expert-level competitive programming.

Multimodal capability and context

CapabilityGemma 3 (4B/12B/27B)Gemma 3n (E2B/E4B)Gemma 4 E2B/E4BGemma 4 26B MoE / 31B Dense
Context window128K32K128K256K
Image inputYesYesYes (variable resolution)Yes (variable resolution)
Video inputShort clips60 fps streamingYes (variable aspect ratio)Yes (variable aspect ratio)
Audio inputNoYes (USM encoder)Yes (native)No
Languages140+140+140+140+
Function callingExternal tooling requiredExternal tooling requiredBuilt-inBuilt-in
Thinking modesNoNoConfigurableConfigurable

Deployment footprint

ModelTotal paramsActive params (MoE)Min RAM/VRAMSuitable target
Gemma 4 E2B~5B~2B effective2–3 GBPhones, IoT, browsers
Gemma 4 E4B~8B~4B effective3–4 GBLaptops, edge boxes
Gemma 4 26B MoE26B3.8B active~16 GB (Q4)RTX 4090 / 5090, Macs (M3 Max+)
Gemma 4 31B Dense31B~22 GB (Q4) / 64 GB (FP16)Workstation GPUs, single H100
Gemma 3 27B27B~18 GB (Q4) / 54 GB (FP16)Workstation GPUs
Gemma 3n E4B8B~4B effective3 GBMid-range phones

Architecture deep dive

Gemma 4: dense + MoE under one umbrella

Gemma 4 keeps the GQA + QK-norm + RoPE foundation from Gemma 3 but adds three things:

  • Mixture-of-Experts at the 26B size. The A4B variant routes each token to a small subset of experts so only ~3.8B parameters are active per forward pass. This is what lets a 26B model run on a single 24 GB consumer GPU at Q4.
  • Configurable thinking modes. The model supports an explicit "think" budget at inference time, similar to o1/o3-style reasoning, with the budget exposed as a sampling parameter rather than a separate model checkpoint.
  • Native multimodal heads. Image, video (variable aspect ratio), and on E2B/E4B audio inputs all share the same backbone — no adapter modules to load or unload.

Gemma 3: standard transformer, multimodal text+image

Gemma 3 uses interleaved local/global attention (5 local layers per global layer, 1024-token local windows) and RoPE with a 1M base frequency for the 128K-context variants. The 1B variant is text-only and was sized specifically for mobile (529 MB on disk after quantization).

Gemma 3n: MatFormer + PLE caching

The MatFormer trick was nesting a smaller model entirely inside a bigger one — E4B contained a real, separately-runnable E2B — so a single download could serve two quality tiers. Per-Layer Embedding caching offloaded embedding tables to disk/CPU and pulled them in lazily, cutting peak memory by ~40%. Conditional parameter loading skipped vision or audio weights when those modalities weren't needed.

Gemma 4 keeps the spirit of Mat­Former — runtime-selectable effective size — but bakes it into the E-series sizing (E2B, E4B) rather than as a "model contains another model" architecture. For most users this is simpler: pick the size, ship it.

When to use which Gemma

Choose Gemma 4 31B Dense when…

  • You need the strongest open-weights reasoning available on a single workstation GPU.
  • You're processing 100K+ token documents and need the 256K context window.
  • You're building agentic workflows that rely on the built-in function calling and configurable thinking modes.
  • You'd otherwise reach for DeepSeek V4 or Qwen 3.6 32B but want Apache 2.0 with no usage caps.

Choose Gemma 4 26B MoE (A4B) when…

  • You're constrained to a single consumer GPU (RTX 4090, 5090, or an M-series Mac with 32+ GB unified memory).
  • You need throughput closer to a 4B model but quality closer to a 26B model.
  • You're running multi-tenant inference and want lower per-token cost.

Choose Gemma 4 E4B when…

  • You're shipping on-device AI on a laptop, browser (WebGPU), or edge accelerator.
  • You need audio + vision + text on the same model with no cloud round-trip.
  • Privacy or offline operation is non-negotiable.

Choose Gemma 4 E2B when…

  • The target is a phone, smartwatch, or microcontroller-class device.
  • Energy budget is the binding constraint (E-series is what Gemma 3n's 0.75 %-battery-per-25-conversations result was measured on; Gemma 4 E2B is in the same envelope).

Stay on Gemma 3 or 3n only when…

  • You have an existing fine-tune in production and the migration cost outweighs the quality gain.
  • You depend on the exact 1B text-only variant of Gemma 3 (Gemma 4's smallest tier is E2B, not 1B).
  • You depend on the explicit Mat­Former-nested-model behaviour from Gemma 3n. Gemma 4 doesn't expose this directly.

Historical: the original Gemma 3 vs Gemma 3n analysis

The rest of this section preserves the 2025-era comparison between Gemma 3 and Gemma 3n for readers maintaining systems on those models. All numbers below are point-in-time; for current-generation deployments, use Gemma 4.

Gemma 3 (legacy)

Built on Gemini 2.0 research. Released March 12, 2025. Sizes: 1B (text-only, 529 MB), 4B (multimodal, 128K context), 12B, 27B. The 27B variant hit 1339 LMSys Arena Elo and 67.5 MMLU-Pro at launch — top-10 globally and top open-weights model that could fit on a single H100. Other 27B numbers worth keeping on file:

BenchmarkGemma 3 27B (2025)
MMLU-Pro67.5
LiveCodeBench (v3)29.7
Bird-SQL54.4
GPQA Diamond42.4
MATH69.0
FACTS Grounding74.9
MMMU64.9
SimpleQA10.0
LMSys Arena Elo1339

Gemma 3n (legacy)

Released June 26, 2025. The MatFormer-based mobile-first sibling. Effective 2B (E2B) and 4B (E4B) memory footprints despite 5B and 8B total parameters. Used Universal Speech Model for audio, MobileNet-V5 for 60 fps video. Real-world numbers from the launch:

  • Inference: up to 2,585 tokens/sec on Pixel 9 Pro (E2B INT4).
  • Energy: 0.75 % battery for 25 conversations (Pixel 9 Pro).
  • Audio encode rate: 6.25 tokens/second.
  • Video: 60 fps real-time analysis with MobileNet-V5.

Both Gemma 3 and Gemma 3n shipped under Google's custom Gemma license, which had usage caps and field-of-use carve-outs. Gemma 4 drops that for Apache 2.0 — meaningful if you're shipping in regulated industries or want frictionless commercial redistribution.

Running Gemma 4 in 2026

Quick start (Hugging Face)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Gemma 4 31B Dense — workstation GPU
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain MoE routing in 3 sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Deployment options

  • Cloud / managed: Vertex AI Model Garden, AWS Bedrock (Gemma 4 listed April 2026), Hugging Face Inference Endpoints.
  • Local server: vLLM 0.10+ supports Gemma 4 MoE routing; llama.cpp added GGUF support on day one.
  • Mac: MLX 0.20+ has dedicated kernels for the 26B MoE; Ollama ships gemma4, gemma4:e4b, and gemma4:31b tags.
  • Mobile: Google AI Edge SDK is the official path for E2B/E4B on Android; MediaPipe LLM API wraps it. iOS uses Core ML conversions.
  • Browser: WebLLM and Transformers.js support E2B/E4B via WebGPU.

Quantization in practice

  • Q4_K_M (GGUF): 31B fits on a single 24 GB GPU; quality loss ~1–2 % on MMLU-Pro.
  • Q8_0: recommended for the 26B MoE — INT4 hurts router accuracy more than dense layers.
  • FP8: H100 / B200 production serving, near-lossless.
  • E2B/E4B INT4: the on-device default. Targets 2–4 GB RAM.

Fine-tuning

LoRA / QLoRA via Hugging Face PEFT works on all four sizes. For Gemma 4 26B MoE, freeze the routing layers — fine-tuning them with small datasets typically degrades quality. For E2B/E4B on-device adaptation, Google's Edge SDK now exposes a federated-learning hook that was experimental in the Gemma 3n era.

FAQ

Should I migrate from Gemma 3 27B to Gemma 4 31B?

Yes if you care about quality. The 31B is a real generational leap — MMLU-Pro 67.5 → 85.2, Codeforces Elo ~110 → ~2,150, LMArena 1339 → ~1452. The license switch to Apache 2.0 is also a meaningful unblock if you're in a regulated industry. The migration cost is mostly re-running fine-tunes against a new tokenizer and prompt template.

Is there a Gemma 4n?

No. The mobile-first work that defined Gemma 3n has been merged into Gemma 4's E2B and E4B sizes. There is no separate "n" SKU in the Gemma 4 generation.

How does Gemma 4 compare to Llama 4 and Qwen 3.6?

On MMLU-Pro, Qwen 3.5/3.6 leads at ~86.1, Gemma 4 31B follows at 85.2, Llama 4 trails. On AIME 2026, Gemma 4 31B (89.2) edges Llama 4 (88.3). On LiveCodeBench v6, Gemma 4 31B (80.0) beats Llama 4 (77.1). For coding specifically, see DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) for a fuller picture across closed and open models.

What's the license?

Gemma 4: Apache 2.0. No usage caps, no field-of-use restrictions. Gemma 3 and Gemma 3n: Google's custom Gemma Terms of Use, with prohibited-use clauses and a "we can update terms" provision.

How big is the context window in practice?

The 26B MoE and 31B Dense advertise 256K tokens. Effective recall holds well to ~128K in our internal needle-in-a-haystack tests; degrades past 200K. For the E-series, 128K advertised and ~64K effective.

When should I enable thinking mode?

Multi-step math, code generation longer than ~50 lines, and tool-use chains. Skip it for chit-chat and simple extraction — the latency cost is real.