Last updated: May 1, 2026
Gemma 4 is the most consequential open-weight model release of the year so far, and not just because of the benchmarks. Google shipped four model sizes, native multimodality, a 256K context window on the larger variants, and — for the first time in the Gemma line — a clean Apache 2.0 license. For engineering teams that have been waiting for an open-weight model good enough to actually replace a frontier API for a meaningful chunk of their workload, this is the first credible candidate from Google.
This guide is the long version: what the family looks like, what the architecture actually does, what the benchmark numbers mean in practice, how it stacks up against Llama 4, Qwen 3.5, DeepSeek V4 Flash, and its own predecessors, where to host it, and where it falls short. If you are evaluating Gemma 4 for production, this is the document to send your team.
TL;DR
- Released: April 2, 2026, by Google DeepMind.
- Family: four sizes — Gemma 4 E2B (~2.3B effective), E4B (~4.5B effective), 26B A4B (Mixture-of-Experts, 4B active), and 31B dense.
- License: Apache 2.0. This is new. Earlier Gemma generations shipped under the custom Gemma Terms of Use, which had usage carve-outs that made enterprise legal review painful. Gemma 4 dropped that.
- Context window: 128K tokens on E2B/E4B; 256K on 26B A4B and 31B.
- Multimodal: all sizes accept text + image; E2B and E4B also accept audio. Output is text-only.
- Strong points: reasoning, math (AIME 2026 ~89%), code generation (LiveCodeBench v6 ~80%), long-context recall, and on-device deployment via MediaPipe / LiteRT.
- Weak points: trails Qwen 3.5 27B on SWE-bench Verified, no native speech output, and Gemma is not Gemini — fine-tuning, weights, and serving are now your problem.
What Gemma 4 Is, And How It Differs From Gemini
Gemma is Google's open-weight model family. Gemini is Google's closed, hosted, frontier model family. They share research lineage — Gemma 4 is described by Google as "built from Gemini 3 research" — but the deployment story is different.
With Gemini you call an API, you pay per token, you do not get the weights, and you cannot fine-tune the underlying parameters (you get adapters at best). With Gemma 4 you download the weights from Hugging Face, Kaggle, or Ollama, you run them on your own hardware (or a cloud GPU you rent), you fine-tune fully, and your unit economics are GPU hours and electricity rather than per-token API spend.
The practical implication: Gemma 4 is the model you reach for when you need on-device inference, when you need to fine-tune on private data, when your token volume makes a hosted API uneconomical, or when you need an air-gapped deployment. Gemini is the model you reach for when you want zero-ops frontier intelligence and you are happy to pay for it.
For a deeper feature-level walkthrough, see our companion piece Google Gemma 4 review: benchmarks, features, and how to run it locally.
The Gemma 4 Family
Four sizes, two architectural patterns (dense and MoE), and a clear split between edge and server tiers.
| Variant | Architecture | Total / Active params | Context | Modalities in | Primary target |
|---|---|---|---|---|---|
| Gemma 4 E2B | Dense | ~2.3B effective | 128K | Text, image, audio | Phones, IoT, low-power laptops |
| Gemma 4 E4B | Dense | ~4.5B effective | 128K | Text, image, audio | High-end phones, edge servers, Raspberry Pi-class |
| Gemma 4 26B A4B | Mixture-of-Experts | 26B total / ~4B active per token | 256K | Text, image | Single high-end GPU server, cost-sensitive throughput |
| Gemma 4 31B | Dense | 30.7B | 256K | Text, image | Quality-first server inference, fine-tuning |
The "E" in E2B/E4B is for edge, not experts. These are dense models built for on-device. The 26B A4B is the MoE: 4 billion parameters fire on any given forward pass, so latency and cost behave like a 4B model, while quality benefits from the full 26B parameter pool. The 31B is the no-tricks dense model — slower than the MoE, but typically the highest-quality answer when you need the best response per query rather than the best response per dollar.
If you are deciding which one to actually pull, our breakdown Gemma 4 vs Gemma 3 vs Gemma 3n: which makes sense in 2026 walks the decision tree per workload.
Architecture, Context Window, And Tokenizer
Gemma 4 keeps the decoder-only transformer skeleton that has defined the family but tightens almost every component. Highlights worth knowing:
- Hybrid attention. Gemma 4 interleaves local sliding-window attention with full global attention, with the final layer always global. Smaller dense models use 512-token sliding windows; larger models use 1024. This is what makes the 256K context feasible without VRAM blowing up linearly.
- RULER long-context recall. On RULER at 128K, Gemma 3 scored 13.5%. Gemma 4 scores 66.4% on the same test. The context window is not just nominal — it actually retrieves at depth.
- Vocabulary. 262,144-token vocabulary, BPE with byte fallback. Strong multilingual coverage — 140+ languages.
- Vision tokens. Variable visual budget (70, 140, 280, 560, or 1120 tokens per image), so you trade quality against context spend.
- Audio (E2B/E4B only). Native speech recognition and audio understanding, no separate ASR layer required for many use cases.
- Reasoning mode. Gemma 4 can produce 4,000+ tokens of explicit reasoning before committing to an answer, plus native function-calling and structured JSON output.
The MoE in 26B A4B is the architectural story to internalise: it lets a single A100 80GB or two consumer GPUs serve a model that punches well above 4B in quality terms, at roughly 4B in cost terms. That is the new dominant design point for the open-weight server tier in 2026.
License: Apache 2.0, Finally
Read this section carefully if you have ever had Legal kill a Gemma rollout.
Earlier Gemma releases shipped under the Gemma Terms of Use, a custom license. It was more permissive than Llama 2's, but it included a Prohibited Use Policy with clauses around harm to minors, attacks on critical infrastructure, generation of CSAM, and other broad carve-outs. The clauses were defensible in spirit, but enterprise legal teams routinely flagged the language as ambiguous and asked for indemnification or scope-limiting before signing off. That friction kept Gemma out of plenty of production stacks.
Gemma 4 ships under Apache 2.0. No custom restrictions, no usage carve-outs, no monthly active user thresholds (the way the Llama 4 Community License has). Apache 2.0 explicitly grants commercial use, modification, redistribution, and distribution of derivative works, including derivative weights. There is one obvious constraint that still applies: Apache 2.0 does not grant trademark rights, so you cannot ship a product called "Gemma" or imply Google endorsement.
This is materially less restrictive than the previous Gemma Terms of Use, and noticeably less restrictive than Llama 4's Community License (which is free for organisations under 700M monthly active users but adds compliance language). For most engineering teams, this is the change that turns Gemma from "interesting" into "approvable."
Two caveats worth being honest about. First, Apache 2.0 governs the weights; it does not give you the training data or the training pipeline. Gemma 4 is open-weight, not open-source in the strict OSI sense applied to data. Second, Google can still publish acceptable-use guidelines separately; nothing about Apache 2.0 prevents that. Today, the license file in the repo is the controlling document — and that document is Apache 2.0.
Benchmarks That Actually Matter
Headline numbers for Gemma 4 31B (instruction-tuned), pulled from Google's model card and independent reproductions on the LM Studio and Hugging Face threads:
| Benchmark | Gemma 4 31B | Gemma 3 27B | Llama 4 Scout (109B) | Qwen 3.5 27B | DeepSeek V4 Flash |
|---|---|---|---|---|---|
| MMLU-Pro | 85.2 | ~67 | ~78 | 86.1 | ~84 |
| GPQA Diamond | 84.3 | 42.4 | ~70 | 85.5 | ~80 |
| LiveCodeBench v6 | 80.0 | 29.1 | ~55 | ~78 | ~74 |
| SWE-bench Verified | ~63 | ~22 | ~48 | 72.4 | ~64 |
| AIME 2026 (math) | 89.2 | 20.8 | ~55 | ~85 | ~82 |
| Codeforces ELO | 2,150 | 110 | ~1,500 | ~1,950 | ~1,800 |
Approximate values for non-Gemma rows are pulled from each project's own card or the Artificial Analysis index; treat them as directional. The story they tell is consistent:
- Gemma 4 31B is in the same neighbourhood as Qwen 3.5 27B on knowledge and reasoning. They trade leadership benchmark by benchmark.
- Gemma 4 has the upper hand on math and competitive programming.
- Qwen 3.5 27B still wins SWE-bench Verified — the benchmark that most closely tracks "can this model close a real GitHub issue." If your primary use case is autonomous code editing on real repos, evaluate Qwen 3.5 alongside Gemma 4 before you commit.
- Gemma 4's gain over Gemma 3 is enormous — multiple benchmarks improved 3–20×. Most teams running Gemma 3 in production should plan a migration window.
For the pairwise drilldowns: Gemma 4 vs Llama 4 for local deployment, Gemma 4 vs Gemma 3: what changed and should you switch, Gemma 4n vs Gemma 4, and our DeepSeek V4 complete guide.
Where To Run Gemma 4
You have three deployment surfaces: hosted, self-hosted server, and on-device.
Hosted
If you want zero ops, the model is a one-line call away on several providers:
- Vertex AI (Model Garden). First-party. You can fine-tune on Vertex AI Training Clusters and serve through Model Garden endpoints. Pay for compute time on the underlying accelerator (A2/G2 family or TPUs).
- OpenRouter. Aggregates 11+ providers for the 26B A4B model at roughly $0.06 per million input tokens and $0.33 per million output. Useful for prototyping and price-sensitive batch work.
- Together AI, Fireworks, Groq, DeepInfra, Hugging Face Inference. All have Gemma 4 endpoints. Pricing varies but the open-weight competitive market keeps it low.
- Cloud Run with GPU. Google's serverless GPU runtime can host Gemma 4 with scale-to-zero, which is attractive for spiky workloads.
Self-hosted server
vLLM is the production default. It supports Gemma 4 on NVIDIA, AMD, and Google Cloud TPUs from day one. Approximate hardware floors:
| Variant | Quant / format | VRAM floor | Notes |
|---|---|---|---|
| 26B A4B | AWQ INT4 | ~15 GB | RTX 4090 24 GB with KV-cache headroom |
| 26B A4B | GGUF Q4_K_M | ~16 GB | llama.cpp / Ollama dev box |
| 26B A4B | FP16 | ~52 GB | A100 80GB or H100; serves at full quality |
| 31B dense | FP16 | ~62 GB | A100 80GB or H100 single-GPU |
| 31B dense | INT4 | ~18 GB | RTX 4090 / 5090 — viable for single-user inference |
Ollama covers the local-laptop use case for E2B, E4B, and the quantised 26B/31B. MLX with Metal acceleration runs all variants on Apple Silicon — an M3 Max or M4 Pro with 32–64 GB unified memory will run the 26B A4B comfortably. AMD has day-zero Gemma 4 support across ROCm and the Ryzen AI stack. NVIDIA NIM, NeMo, LM Studio, Unsloth, SGLang, and LiteRT-LM all have first-class support.
For a step-by-step Ollama setup, see How to run Gemma 4 with Ollama, and for a hardware-centric walkthrough, Run Gemma 4 on your PC and devices locally.
On-device with MediaPipe and LiteRT
The E2B and E4B variants are explicitly designed for phones and edge devices. The deployment stack is MediaPipe's LLM Inference API on top of LiteRT, which handles model loading, memory, and hardware acceleration (GPU or NPU) automatically. Approximate footprints:
- E2B Q4_K_M: ~1.3 GB on disk, 2–3 GB RAM at runtime.
- E4B Q4_K_M: ~2.5 GB on disk, 4–5 GB RAM at runtime.
This is the path for "AI features that work without a network round-trip" — voice agents on Android, in-browser RAG over a user's local documents, and offline coding helpers. With audio input native to E2B/E4B, you can ship a meaningful voice-to-text-to-action loop without bundling a separate ASR model.
When To Choose Gemma 4 Over Alternatives
Reach for Gemma 4 when:
- You need an Apache 2.0 model. If Legal balked at Gemma 3's terms or Llama's Community License MAU clause, Gemma 4 is the cleanest option in this size class.
- You need on-device multimodality. The audio-capable E2B/E4B variants are the strongest open-weight option for phones today.
- Long context matters. 256K with credible RULER recall is competitive with hosted frontier models.
- Math, agentic reasoning, or competitive programming dominate your workload. Gemma 4 31B's AIME and Codeforces numbers are exceptional for an open-weight model in this size band.
Choose something else when:
- Your workload is autonomous repo editing. Qwen 3.5 27B's SWE-bench Verified lead is real. Pilot both before committing.
- You need streaming voice output. Gemma 4 has audio in but not out. Qwen 3.5-Omni handles real-time speech generation.
- You need a frontier model. If quality is the only metric, hosted Gemini 3 Pro or DeepSeek V4 Pro will outperform Gemma 4 31B on most benchmarks.
- Cost-per-token at huge scale. DeepSeek V4 Flash hosted is cheap enough that for many workloads the spend math beats running your own GPUs.
Known Issues And License Caveats
- SWE-bench Verified is not the strong suit. Real GitHub issue resolution still trails Qwen 3.5 27B by a meaningful margin.
- No native audio output. If you want a voice agent that talks back, you bolt on a separate TTS layer.
- 26B A4B throughput surprise. Despite only 4B active parameters, community benchmarks on consumer GPUs show ~11 tok/s on an RTX 4090 — slower than a comparable dense 4B model. The MoE routing overhead is real on consumer hardware. On A100/H100 the gap closes.
- Apache 2.0 ≠ open-source training data. The weights are open and commercially usable; the training corpus is not. If your compliance posture requires reproducibility from data, Gemma 4 does not satisfy that.
- Trademark. You cannot brand your product as "Gemma" or use Google trademarks. Apache 2.0 explicitly excludes trademark grants.
- Vision token budget tradeoff. The 70/140/280/560/1120 visual budgets are real — undersized budgets degrade OCR and chart reading noticeably. Pick deliberately.
- better-sqlite3-style native dependencies. If you self-host with vLLM behind a Node service, watch out for prebuilt-binary fetch issues on locked-down installs; the failure mode is silent at install time and loud at runtime.
- Tokenizer drift from Gemma 3. The 262K vocabulary is not directly weight-compatible with Gemma 3 fine-tunes. Plan a re-finetune, do not try to port adapters.
FAQ
Is Gemma 4 actually open-source?
It is open-weight under Apache 2.0. The weights, model card, and inference code are open and commercially usable. The training data and full pipeline are not released. By the OSI's strict definition, that is open-weight, not open-source — but for most commercial deployment purposes, Apache 2.0 is the cleanest license you will see in this size class.
Is the Gemma 4 license really Apache 2.0?
Yes. This is the change from earlier Gemma versions, which used the custom Gemma Terms of Use with usage carve-outs. Gemma 4's repository ships the standard Apache 2.0 license file. Anyone telling you Gemma 4 has restrictive terms is describing the previous generation.
What is the difference between Gemma 4 and Gemini?
Gemma 4 is open-weight and self-hostable; Gemini is a closed, hosted, frontier model. They share research lineage but different deployment models, costs, and customisation surfaces.
Which Gemma 4 model should I pick?
E2B for phones and tight memory budgets, E4B for high-end edge and small servers, 26B A4B for cost-efficient single-GPU server inference, 31B dense for highest-quality answers when you do not care about throughput.
What hardware do I need to run Gemma 4 31B?
FP16 needs roughly 62 GB VRAM — A100 80GB or H100. INT4 quantised drops that to about 18 GB, fitting an RTX 4090 or 5090 for single-user inference.
Does Gemma 4 support function calling?
Yes. Native function calling, structured JSON output, and system instructions are first-class.
How does Gemma 4 compare to Llama 4?
Gemma 4 31B beats Llama 4 Scout (109B) on most reasoning benchmarks at roughly a third of the active-parameter cost, and ships under a less restrictive license.
Is Gemma 4 better than Qwen 3.5?
It depends on the workload. Gemma 4 wins on math and competitive programming; Qwen 3.5 27B wins on MMLU-Pro, GPQA Diamond, and SWE-bench Verified. Both are Apache 2.0. Pilot both.
Is Gemma 4 multimodal?
All variants accept text and image. E2B and E4B also accept audio. Output is text-only on every variant.
What is the context window?
128K tokens on E2B/E4B; 256K on 26B A4B and 31B. RULER long-context recall at 128K is roughly 66.4% — a 5× improvement over Gemma 3.
Can Gemma 4 run on a phone?
Yes. E2B and E4B are designed for it. MediaPipe's LLM Inference API and LiteRT handle on-device inference with NPU and GPU acceleration on Android, and equivalent paths exist on iOS via Core ML / MLX.
What is Gemma 4n?
"Gemma 4n" is the community shorthand for the E2B / E4B edge variants — the on-device tier of the Gemma 4 family. Architecturally they are dense models tuned and quantised for phones and embedded devices. See Gemma 4n vs Gemma 4 for the side-by-side.
Is Gemma 4 safe for commercial production use?
Yes, under Apache 2.0, with the standard caveats: respect trademarks, do not redistribute the model under the Gemma name, and follow your own jurisdiction's AI usage law. There are no usage carve-outs, MAU thresholds, or industry restrictions in the license itself.
Should I migrate from Gemma 3 to Gemma 4?
If you are running Gemma 3 in production, yes. The benchmark deltas are large (3–20× on reasoning and code), the license is cleaner, the context window is bigger, and the deployment story is unchanged. Plan a re-finetune — adapter weights will not transfer cleanly.
Next Steps
Picking the right open-weight model is the easy half. The hard half is hiring engineers who can fine-tune it on your data, harden the inference path, and ship it without bricking your unit economics.
Hire a Codersera-vetted Python or ML engineer who has actually deployed Gemma-class models on vLLM, MediaPipe, and MLX. Vetted technical fit, remote-ready, and a risk-free trial so you only keep the engineers who deliver.