Last updated April 2026 — refreshed for current model/tool versions.
Google's open-weights Gemma family expanded again on April 2, 2026 with the launch of Gemma 4, built on Gemini 3 research and released under the permissive Apache 2.0 license. Gemma 4 supersedes both Gemma 3 (March 2025) and the mobile-first Gemma 3n (June 2025), but the older models still ship in production and remain useful reference points. This guide compares all three generations head-to-head, with current 2026 numbers, so you can pick the right Gemma for cloud, edge, or on-device workloads.
What changed in 2026: Gemma 4 replaces Gemma 3's 1B/4B/12B/27B lineup with E2B, E4B, 26B MoE (A4B), and 31B Dense; context jumps to 256K tokens (medium tier) or 128K (small tier); audio input becomes native on E2B/E4B; the license moves from Google's restrictive Gemma terms to Apache 2.0. The 31B Dense model now scores 85.2 on MMLU-Pro and ~1452 LMArena Elo, a roughly 87-point jump over Gemma 3 27B.
TL;DR
- Use Gemma 4 31B Dense for workstation-class reasoning, long-context analysis, and agentic workflows. It's the new flagship and beats Gemma 3 27B on every public benchmark.
- Use Gemma 4 26B MoE (A4B) for consumer-GPU deployment — only 3.8B active parameters per token, comparable quality to the 31B Dense on most tasks.
- Use Gemma 4 E4B for edge devices, laptops without a GPU, and high-quality on-device inference.
- Use Gemma 4 E2B for phones and resource-constrained IoT — direct successor to Gemma 3n E2B with native audio.
- Stay on Gemma 3 / Gemma 3n only if you've already shipped a fine-tune you don't want to redo. Otherwise, migrate.
For a wider look at how Google's open lineup stacks up against frontier closed models, see our pillar comparison DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026), which benchmarks Gemma 4 alongside DeepSeek V4, Claude 4.7, GPT-5.5, and Qwen 3.6 on real coding workloads.
The Gemma family in 2026
Gemma 4 (April 2, 2026) — current flagship
Gemma 4 is the first Gemma generation built on Gemini 3 research. It ships in four sizes designed to span phone-to-workstation deployment without changing prompts or tooling:
- Gemma 4 E2B — effective 2B parameters, 128K context, text/image/video/audio in. Targets phones and lightweight edge devices.
- Gemma 4 E4B — effective 4B parameters, 128K context, text/image/video/audio in. Targets laptops, browsers (via WebGPU), and edge accelerators.
- Gemma 4 26B MoE (A4B) — 26B total parameters, ~3.8B active per token, 256K context, text/image/video in. Targets consumer GPUs (RTX 4090 / 5090 class).
- Gemma 4 31B Dense — 31B parameters, 256K context, text/image/video in. Targets workstation GPUs and small inference servers.
All four ship with configurable thinking modes, native system prompts, and built-in function calling. Licensing is Apache 2.0 — no usage caps, no field-of-use restrictions, drop-in for commercial production.
Gemma 3 (March 2025) — previous generation
Gemma 3 introduced multimodality (text + image + short video) to the Gemma line and shipped in 1B / 4B / 12B / 27B sizes. Its 27B variant was, at launch, the highest-Elo open model that could fit on a single H100 — peaking at 1339 LMSys Arena Elo and 67.5 MMLU-Pro. Architecturally it's a standard transformer with Grouped Query Attention, QK-norm, and a 5:1 local/global interleaved attention pattern (1024-token local windows). Context: 32K on 1B, 128K on 4B/12B/27B.
Gemma 3n (June 2025) — mobile experiment, now superseded by Gemma 4 E-series
Gemma 3n was the on-device sibling to Gemma 3, introducing the MatFormer (Matryoshka Transformer) architecture, Per-Layer Embedding (PLE) caching, and conditional parameter loading. The headline trick: an 8B "E4B" model contained a fully functional 5B "E2B" sub-model, giving runtime size selection. Effective memory was 2 GB (E2B) and 3 GB (E4B). Native audio via the Universal Speech Model encoder and 60 fps video via MobileNet-V5 made it the first Gemma usable for real-time mobile assistants.
In Gemma 4, the MatFormer ideas have been folded into the E2B / E4B tier directly, so there is no separate "Gemma 4n" — Gemma 4 E2B/E4B is the successor to Gemma 3n.
Head-to-head benchmarks (2026)
Reasoning, knowledge, and math
| Benchmark | Gemma 3 27B | Gemma 3n E4B | Gemma 4 31B Dense | Gemma 4 26B MoE |
|---|---|---|---|---|
| MMLU-Pro | 67.5 | ~55 | 85.2 | ~83 |
| GPQA Diamond | 42.4 | n/a | 84.3 | ~80 |
| AIME 2026 (math) | n/a | n/a | 89.2 | ~85 |
| LiveCodeBench v6 | 29.7 (v3) | n/a | 80.0 | ~76 |
| Codeforces Elo | ~110 | n/a | 2150 | ~1900 |
| LMArena (human pref) Elo | 1339 | ~1180 | ~1452 (top 3) | ~1430 |
The MMLU-Pro jump from 67.5 → 85.2 between Gemma 3 27B and Gemma 4 31B is one of the largest single-generation gains in any open-weights line. The Codeforces Elo move (110 → 2,150) is even more dramatic — Gemma 3 27B could barely clear beginner problems, while Gemma 4 31B reaches expert-level competitive programming.
Multimodal capability and context
| Capability | Gemma 3 (4B/12B/27B) | Gemma 3n (E2B/E4B) | Gemma 4 E2B/E4B | Gemma 4 26B MoE / 31B Dense |
|---|---|---|---|---|
| Context window | 128K | 32K | 128K | 256K |
| Image input | Yes | Yes | Yes (variable resolution) | Yes (variable resolution) |
| Video input | Short clips | 60 fps streaming | Yes (variable aspect ratio) | Yes (variable aspect ratio) |
| Audio input | No | Yes (USM encoder) | Yes (native) | No |
| Languages | 140+ | 140+ | 140+ | 140+ |
| Function calling | External tooling required | External tooling required | Built-in | Built-in |
| Thinking modes | No | No | Configurable | Configurable |
Deployment footprint
| Model | Total params | Active params (MoE) | Min RAM/VRAM | Suitable target |
|---|---|---|---|---|
| Gemma 4 E2B | ~5B | ~2B effective | 2–3 GB | Phones, IoT, browsers |
| Gemma 4 E4B | ~8B | ~4B effective | 3–4 GB | Laptops, edge boxes |
| Gemma 4 26B MoE | 26B | 3.8B active | ~16 GB (Q4) | RTX 4090 / 5090, Macs (M3 Max+) |
| Gemma 4 31B Dense | 31B | — | ~22 GB (Q4) / 64 GB (FP16) | Workstation GPUs, single H100 |
| Gemma 3 27B | 27B | — | ~18 GB (Q4) / 54 GB (FP16) | Workstation GPUs |
| Gemma 3n E4B | 8B | ~4B effective | 3 GB | Mid-range phones |
Architecture deep dive
Gemma 4: dense + MoE under one umbrella
Gemma 4 keeps the GQA + QK-norm + RoPE foundation from Gemma 3 but adds three things:
- Mixture-of-Experts at the 26B size. The A4B variant routes each token to a small subset of experts so only ~3.8B parameters are active per forward pass. This is what lets a 26B model run on a single 24 GB consumer GPU at Q4.
- Configurable thinking modes. The model supports an explicit "think" budget at inference time, similar to o1/o3-style reasoning, with the budget exposed as a sampling parameter rather than a separate model checkpoint.
- Native multimodal heads. Image, video (variable aspect ratio), and on E2B/E4B audio inputs all share the same backbone — no adapter modules to load or unload.
Gemma 3: standard transformer, multimodal text+image
Gemma 3 uses interleaved local/global attention (5 local layers per global layer, 1024-token local windows) and RoPE with a 1M base frequency for the 128K-context variants. The 1B variant is text-only and was sized specifically for mobile (529 MB on disk after quantization).
Gemma 3n: MatFormer + PLE caching
The MatFormer trick was nesting a smaller model entirely inside a bigger one — E4B contained a real, separately-runnable E2B — so a single download could serve two quality tiers. Per-Layer Embedding caching offloaded embedding tables to disk/CPU and pulled them in lazily, cutting peak memory by ~40%. Conditional parameter loading skipped vision or audio weights when those modalities weren't needed.
Gemma 4 keeps the spirit of MatFormer — runtime-selectable effective size — but bakes it into the E-series sizing (E2B, E4B) rather than as a "model contains another model" architecture. For most users this is simpler: pick the size, ship it.
When to use which Gemma
Choose Gemma 4 31B Dense when…
- You need the strongest open-weights reasoning available on a single workstation GPU.
- You're processing 100K+ token documents and need the 256K context window.
- You're building agentic workflows that rely on the built-in function calling and configurable thinking modes.
- You'd otherwise reach for DeepSeek V4 or Qwen 3.6 32B but want Apache 2.0 with no usage caps.
Choose Gemma 4 26B MoE (A4B) when…
- You're constrained to a single consumer GPU (RTX 4090, 5090, or an M-series Mac with 32+ GB unified memory).
- You need throughput closer to a 4B model but quality closer to a 26B model.
- You're running multi-tenant inference and want lower per-token cost.
Choose Gemma 4 E4B when…
- You're shipping on-device AI on a laptop, browser (WebGPU), or edge accelerator.
- You need audio + vision + text on the same model with no cloud round-trip.
- Privacy or offline operation is non-negotiable.
Choose Gemma 4 E2B when…
- The target is a phone, smartwatch, or microcontroller-class device.
- Energy budget is the binding constraint (E-series is what Gemma 3n's 0.75 %-battery-per-25-conversations result was measured on; Gemma 4 E2B is in the same envelope).
Stay on Gemma 3 or 3n only when…
- You have an existing fine-tune in production and the migration cost outweighs the quality gain.
- You depend on the exact 1B text-only variant of Gemma 3 (Gemma 4's smallest tier is E2B, not 1B).
- You depend on the explicit MatFormer-nested-model behaviour from Gemma 3n. Gemma 4 doesn't expose this directly.
Historical: the original Gemma 3 vs Gemma 3n analysis
The rest of this section preserves the 2025-era comparison between Gemma 3 and Gemma 3n for readers maintaining systems on those models. All numbers below are point-in-time; for current-generation deployments, use Gemma 4.
Gemma 3 (legacy)
Built on Gemini 2.0 research. Released March 12, 2025. Sizes: 1B (text-only, 529 MB), 4B (multimodal, 128K context), 12B, 27B. The 27B variant hit 1339 LMSys Arena Elo and 67.5 MMLU-Pro at launch — top-10 globally and top open-weights model that could fit on a single H100. Other 27B numbers worth keeping on file:
| Benchmark | Gemma 3 27B (2025) |
|---|---|
| MMLU-Pro | 67.5 |
| LiveCodeBench (v3) | 29.7 |
| Bird-SQL | 54.4 |
| GPQA Diamond | 42.4 |
| MATH | 69.0 |
| FACTS Grounding | 74.9 |
| MMMU | 64.9 |
| SimpleQA | 10.0 |
| LMSys Arena Elo | 1339 |
Gemma 3n (legacy)
Released June 26, 2025. The MatFormer-based mobile-first sibling. Effective 2B (E2B) and 4B (E4B) memory footprints despite 5B and 8B total parameters. Used Universal Speech Model for audio, MobileNet-V5 for 60 fps video. Real-world numbers from the launch:
- Inference: up to 2,585 tokens/sec on Pixel 9 Pro (E2B INT4).
- Energy: 0.75 % battery for 25 conversations (Pixel 9 Pro).
- Audio encode rate: 6.25 tokens/second.
- Video: 60 fps real-time analysis with MobileNet-V5.
Both Gemma 3 and Gemma 3n shipped under Google's custom Gemma license, which had usage caps and field-of-use carve-outs. Gemma 4 drops that for Apache 2.0 — meaningful if you're shipping in regulated industries or want frictionless commercial redistribution.
Running Gemma 4 in 2026
Quick start (Hugging Face)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Gemma 4 31B Dense — workstation GPU
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31b-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Explain MoE routing in 3 sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Deployment options
- Cloud / managed: Vertex AI Model Garden, AWS Bedrock (Gemma 4 listed April 2026), Hugging Face Inference Endpoints.
- Local server: vLLM 0.10+ supports Gemma 4 MoE routing; llama.cpp added GGUF support on day one.
- Mac: MLX 0.20+ has dedicated kernels for the 26B MoE; Ollama ships
gemma4,gemma4:e4b, andgemma4:31btags. - Mobile: Google AI Edge SDK is the official path for E2B/E4B on Android; MediaPipe LLM API wraps it. iOS uses Core ML conversions.
- Browser: WebLLM and Transformers.js support E2B/E4B via WebGPU.
Quantization in practice
- Q4_K_M (GGUF): 31B fits on a single 24 GB GPU; quality loss ~1–2 % on MMLU-Pro.
- Q8_0: recommended for the 26B MoE — INT4 hurts router accuracy more than dense layers.
- FP8: H100 / B200 production serving, near-lossless.
- E2B/E4B INT4: the on-device default. Targets 2–4 GB RAM.
Fine-tuning
LoRA / QLoRA via Hugging Face PEFT works on all four sizes. For Gemma 4 26B MoE, freeze the routing layers — fine-tuning them with small datasets typically degrades quality. For E2B/E4B on-device adaptation, Google's Edge SDK now exposes a federated-learning hook that was experimental in the Gemma 3n era.
FAQ
Should I migrate from Gemma 3 27B to Gemma 4 31B?
Yes if you care about quality. The 31B is a real generational leap — MMLU-Pro 67.5 → 85.2, Codeforces Elo ~110 → ~2,150, LMArena 1339 → ~1452. The license switch to Apache 2.0 is also a meaningful unblock if you're in a regulated industry. The migration cost is mostly re-running fine-tunes against a new tokenizer and prompt template.
Is there a Gemma 4n?
No. The mobile-first work that defined Gemma 3n has been merged into Gemma 4's E2B and E4B sizes. There is no separate "n" SKU in the Gemma 4 generation.
How does Gemma 4 compare to Llama 4 and Qwen 3.6?
On MMLU-Pro, Qwen 3.5/3.6 leads at ~86.1, Gemma 4 31B follows at 85.2, Llama 4 trails. On AIME 2026, Gemma 4 31B (89.2) edges Llama 4 (88.3). On LiveCodeBench v6, Gemma 4 31B (80.0) beats Llama 4 (77.1). For coding specifically, see DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) for a fuller picture across closed and open models.
What's the license?
Gemma 4: Apache 2.0. No usage caps, no field-of-use restrictions. Gemma 3 and Gemma 3n: Google's custom Gemma Terms of Use, with prohibited-use clauses and a "we can update terms" provision.
How big is the context window in practice?
The 26B MoE and 31B Dense advertise 256K tokens. Effective recall holds well to ~128K in our internal needle-in-a-haystack tests; degrades past 200K. For the E-series, 128K advertised and ~64K effective.
When should I enable thinking mode?
Multi-step math, code generation longer than ~50 lines, and tool-use chains. Skip it for chit-chat and simple extraction — the latency cost is real.
Related on Codersera
- DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) — pillar comparison covering Gemma 4, DeepSeek V4, Claude 4.7, GPT-5.5, and Qwen 3.6.
- How to Run Gemma on a Mac — step-by-step local deployment (now refreshed for Gemma 4 via MLX/Ollama).
- How to Run Gemma on Windows — local deployment on Windows with WSL2 and DirectML.
- Gemma vs Qwen: open-source LLM comparison — Apache 2.0 vs Tongyi Qianwen license tradeoffs and benchmark deltas.