Last updated April 2026 — refreshed for current model/tool versions.
Google DeepMind released Gemma 4 on April 2, 2026 under the Apache 2.0 license, and the fastest path from a clean machine to a working local endpoint is still Ollama. This guide walks through the full Gemma 4 Ollama setup: pick the right variant for your hardware, install Ollama 0.22.x, pull the model, fix the silent 4K context default, and call the local REST API. Everything below has been verified against the official Hugging Face model card and the Ollama library page as of late April 2026.
What changed in 2026Gemma 4 launched April 2, 2026 with four variants — E2B, E4B, 26B-A4B (MoE), and 31B Dense — all under Apache 2.0 with no MAU cap or acceptable-use restriction.Context jumped to 256K tokens on the 26B and 31B (128K on E2B/E4B). Ollama still defaults to 4K — you must overridenum_ctxor you are leaving most of the model unused.The 26B is a Mixture-of-Experts with ~4B active parameters per token, so it runs faster than its 18 GB on-disk size suggests and fits a 24 GB GPU comfortably.Multimodal in, by default: all four variants take image input; E2B and E4B also take audio. Ollama exposes this through the standardimagesfield on/api/generate.Inference-stack churn: the first three weeks after launch, llama.cpp shipped 20+ bug fixes for Gemma 4 (tool calling, reasoning budget, sliding-window attention). Run Ollama 0.22.0 or newer — older builds have known correctness issues.Leaderboard placement: Gemma 4 31B sits at #3 on the LMArena open-model text leaderboard (~1452 ELO); the 26B MoE is #6 with only 4B active params.
Want the full picture? Read our continuously-updated Gemma 4 Complete Guide (2026) — small-footprint open weights, on-device deployment, and benchmarks.
TL;DR — which Gemma 4 tag to pull
| Variant | Ollama tag | Download | Min VRAM/RAM | Context | Best for |
|---|---|---|---|---|---|
| E2B | gemma4:e2b | ~7.2 GB | 4 GB | 128K | Phones, Pi 5, low-end laptops |
| E4B (default) | gemma4 / gemma4:e4b | ~9.6 GB | 6 GB | 128K | Most developer laptops; 16 GB M-series Macs |
| 26B MoE (A4B) | gemma4:26b | ~18 GB | 16 GB | 256K | RTX 3090/4080+, 32 GB M-series Macs |
| 31B Dense | gemma4:31b | ~20 GB | 20 GB | 256K | RTX 4090, M3/M4 Max with 64 GB |
If you are unsure, pull gemma4 (E4B). It is the variant Google and Hugging Face both flag as the best general-purpose starting point, and it is the one that fits the widest range of hardware without compromise. For a wider context on running local agents on top of this endpoint, our OpenClaw + Ollama setup guide for running local AI agents is the pillar reference.
What is Gemma 4?
Gemma 4 is Google DeepMind's fourth generation of open-weight models, distilled from the same research that produced Gemini 3. Compared to Gemma 3 it is a substantially different family — different architecture (alternating local sliding-window and global full-context attention layers), different size lineup, and a much wider modality surface. All four checkpoints are released as both base and instruction-tuned (IT) variants on Hugging Face under google/gemma-4-*.
- E2B — 2.3B effective / 5.1B total parameters. Designed to run on phones.
- E4B — 4.5B effective / 8B total. The on-device flagship.
- 26B-A4B — 26B total / 4B activated per token. The MoE workhorse for consumer GPUs.
- 31B Dense — full 31B parameters active per token. Workstation-class quality.
Need help deciding between Gemma generations? Our Gemma 4 vs Gemma 3 vs Gemma 3n breakdown covers what changed and which is right for which use case, and the Gemma 4 vs Llama 4 deployment comparison covers cross-family choice.
Hardware requirements (verified)
The numbers below come from the Hugging Face model card and from independent local testing on r/LocalLLaMA and Anass Kartit's MacBook benchmark. Treat the VRAM minimums as floors with no headroom for the KV cache; add at least 25% if you plan to push context past 32K.
- E2B: 4 GB VRAM or 8 GB RAM. Runs on Raspberry Pi 5 (8 GB), iGPUs, M1 Air. ~95 tok/s on M-series.
- E4B: 6 GB VRAM or 12 GB unified RAM. ~57 tok/s on Ollama on an M4 Pro. The "just works" tier.
- 26B (MoE): 16 GB VRAM minimum at Q4_K_M. The MoE design means inference cost is closer to a 4B model — but the full 26B has to live in memory.
- 31B Dense: 20 GB VRAM at Q4_K_M; 32 GB+ if you want Q8 or long context. Reported ~7.5 tok/s eval rate at Q4 on a 64 GB M1 Max.
Apple Silicon: Unified memory means the VRAM figure equals the RAM figure. A 16 GB M2 runs E4B with room for IDEs open; 32 GB clears the 26B MoE; 64 GB+ is required for the 31B at any useful context. Ollama 0.22 ships with the MLX runner enabled by default, which adds fused top-P/top-K sampling for noticeably faster prompt processing on M-series chips.
AMD GPUs: ROCm 6.x with an RX 7900 XT/XTX or W7900 is the supported configuration. Day-0 support shipped for all four variants.
Step 1 — Install Ollama
You want Ollama 0.22.0 or later (April 2026). Earlier 0.21.x builds predate the llama.cpp Gemma 4 fixes and will produce subtly worse output, especially on tool-calling.
macOS
# Recommended: download the .dmg from ollama.com (ships the desktop app + MLX runner)
# Or via Homebrew:
brew install ollamaLinux
curl -fsSL https://ollama.com/install.sh | shWindows
Download the .exe installer from ollama.com. The Ollama service starts automatically and listens on localhost:11434.
Confirm:
ollama --version
# ollama version is 0.22.0 (or newer)Step 2 — Pull a Gemma 4 model
# Default tag — E4B (~9.6 GB)
ollama pull gemma4
# Specific variants:
ollama pull gemma4:e2b # ~7.2 GB
ollama pull gemma4:26b # ~18 GB — MoE
ollama pull gemma4:31b # ~20 GB — denseVerify:
ollama listPulls resume from where they stopped if your connection drops, so you do not need to start over. Default quantization is Q4_K_M, which is a good speed/quality balance for all variants. Ollama also publishes Q8_0 and bf16 tags (e.g. gemma4:31b-q8_0) for users with VRAM headroom; expect a 2–3% benchmark improvement at 2× the memory cost.
Step 3 — Run it
ollama run gemma4
>>> Explain transformer attention in two sentences for a software engineer.Exit with /bye. For one-shot scripted use:
ollama run gemma4 "Write a Python function that flattens a nested list."The configuration step that actually matters: num_ctx
Ollama's default context window is 4096 tokens, regardless of what the model supports. If you do nothing, you are running a 256K-context model in a 4K box. Always override.
For the current session
>>> /set parameter num_ctx 32768Common values: 16384 (16K), 32768 (32K), 131072 (128K, full E4B), 262144 (256K, full 31B/26B). KV-cache memory grows roughly linearly with num_ctx, so start at 32K and only push higher when your task genuinely needs it.
Permanently — Modelfile
FROM gemma4
PARAMETER num_ctx 32768
PARAMETER num_gpu 99num_gpu 99 means "offload as many layers as possible to GPU." Drop it on CPU-only setups. Build:
ollama create gemma4-32k -f Modelfile
ollama run gemma4-32kThe new tag persists across sessions and shows up in ollama list.
Calling Gemma 4 from your code
Ollama exposes both a native API and an OpenAI-compatible shim at http://localhost:11434. The OpenAI shim is what makes Ollama drop-in compatible with most existing SDKs and agent frameworks.
Generate (native)
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"prompt": "Summarize why local AI inference matters for developer privacy.",
"stream": false
}'Chat (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "What are three good uses of a 256K context window?"}
]
}'Multimodal: send an image
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"prompt": "What is in this screenshot?",
"images": ["'"$(base64 -w0 screenshot.png)"'"],
"stream": false
}'Python
import requests
r = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma4",
"prompt": "List five practical uses for local LLM inference.",
"stream": False,
},
)
print(r.json()["response"])Point any OpenAI SDK at http://localhost:11434/v1 with a dummy API key and most call sites work unchanged. For full agent stacks on top of the Ollama endpoint — function calling, tool routing, web search — the OpenClaw + Ollama setup guide for running local AI agents covers the integration end-to-end.
How to choose: a decision tree
- Phone, Pi, or 8 GB laptop? →
gemma4:e2b. - 16 GB laptop or M1/M2 Mac? →
gemma4(E4B). Default for a reason. - RTX 3080/3090/4080 or 32 GB Mac? →
gemma4:26b. MoE makes it the best quality-per-token-second tier. - RTX 4090 or 64 GB+ Mac, and you need maximum quality? →
gemma4:31b. - Long-document RAG / large-codebase analysis? → 26B or 31B at
num_ctx 131072+. The smaller variants top out at 128K and the KV cache will dominate. - Vision-heavy workload? → All variants accept image input; pick by the text-quality tier you need.
- Audio in? → E2B or E4B only.
Performance and benchmarks (April 2026)
Headline numbers from the official Hugging Face model card for the instruction-tuned variants:
| Benchmark | 31B Dense | 26B-A4B (MoE) | E4B | E2B |
|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% |
| AIME 2026 | 89.2% | 88.3% | 42.5% | 37.5% |
| LiveCodeBench | 80.0% | 77.1% | 52.0% | 44.0% |
| MMMU Pro (vision) | 76.9% | 73.8% | 52.6% | 44.2% |
| LMArena (text only, ELO) | ~1452 (#3 open) | ~1441 (#6 open) | — | — |
Real-world Ollama throughput, measured by independent testers on consumer hardware:
- E2B on M-series: ~95 tok/s.
- E4B on M4 Pro: ~57 tok/s.
- 31B Q4_K_M on M1 Max 64 GB: ~7.5 tok/s eval rate. Slow but usable for batch work.
- 26B MoE on RTX 4090: 35–45 tok/s in practice; significantly faster than the 31B Dense at comparable quality.
Coding-specific note: Gemma 4 31B's reported Codeforces ELO jumped from ~110 (Gemma 3) to ~2150 — a generational leap that makes the 31B a serious option for offline coding assistants if you have the hardware.
Common pitfalls and troubleshooting
You forgot to set num_ctx
Symptoms: model "forgets" anything earlier in the conversation, RAG retrieval looks broken, long-document answers are wrong. Fix: /set parameter num_ctx 32768 or bake it into a Modelfile.
GPU not detected, model fell back to CPU
ollama psIf the GPU column is empty, add PARAMETER num_gpu 99 and recreate. On Linux, confirm CUDA visibility with nvidia-smi; on Windows, confirm the latest NVIDIA driver is installed; on AMD, confirm ROCm 6.x is on PATH. There was a known Flash Attention misreporting issue in Ollama 0.21.0–0.21.2 that was fixed in 0.22.0.
Out of memory
Either drop a tier (gemma4:e4b instead of :26b) or reduce context (num_ctx 8192). KV-cache for 256K context on the 31B can exceed 12 GB on its own.
Tool calling produces malformed output
r/LocalLLaMA reported ~15% format error rates on Gemma 4 31B function calling at 4-bit in the first week. Most of these were llama.cpp bugs, not model bugs — they are fixed in Ollama 0.22.0. If you are still seeing them, upgrade and consider running Q8 instead of Q4 for tool-calling workloads.
Slow first token, fast subsequent tokens
That is prompt processing, and it scales with num_ctx. If unacceptable, drop context size, or set OLLAMA_KEEP_ALIVE=-1 to keep the model resident so the next call skips load time:
# Linux (systemd):
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
# Mac/Windows: add to your shell profile.Ollama not responding
ollama serve
curl http://localhost:11434Performance tips that actually move the needle
- Right-size the context. 32K is enough for almost everything. 128K and 256K are for documents and codebases, not chat.
- Keep the model warm.
OLLAMA_KEEP_ALIVE=-1is the single biggest latency win for interactive use. - Use the MoE. The 26B is faster than the 31B at ~95% of the quality. Default to it once you have the VRAM.
- Partial offload. If you are 1–2 GB short,
num_gpu 30(or whatever fits) keeps most layers on GPU and offloads the rest to CPU — far better than full CPU. - Stay current. Ollama is shipping weekly.
ollama --version, then update if it is older than 0.22.0.
FAQ
Is Gemma 4 free for commercial use?
Yes. Gemma 4 ships under the Apache 2.0 license — no MAU caps, no acceptable-use restrictions, full commercial freedom. This is a meaningful change from earlier Gemma releases that used a custom Google license.
What's the real difference between the 26B MoE and the 31B Dense?
Quality is within 2–4 points on most benchmarks. Throughput is the deciding factor: the 26B activates ~4B params per token, the 31B activates all 31B. On the same GPU, the 26B is roughly 4–5× faster at inference. Pick the 31B only when the quality gap matters and you have the wall-clock budget.
Can I fine-tune Gemma 4 locally?
Yes. Unsloth ships QLoRA recipes that fit Gemma 4 E4B fine-tuning into ~12 GB VRAM and Gemma 4 31B into ~24 GB. Hugging Face transformers supports full fine-tuning for users with H100-class hardware.
Does Ollama support Gemma 4's vision and audio inputs?
Vision yes — pass base64 images via the images field on /api/generate. Audio support on E2B/E4B is being added across runtimes; check ollama.com/library/gemma4 for the current status of audio-capable tags.
Why does my Gemma 4 output look different from the Hugging Face demo?
Almost always one of three things: (1) wrong context window — Ollama defaults to 4K; (2) Q4_K_M quantization vs the bf16 demo; (3) llama.cpp bug fixes that landed after your Ollama version was built. Update Ollama, set num_ctx, and consider Q8_0 for parity testing.
Should I use Gemma 4 or Llama 4 for local inference?
Different sweet spots. Llama 4's smallest variant is larger than Gemma 4 E4B, so on a 16 GB laptop Gemma 4 wins by default. Above 24 GB VRAM the comparison is closer — see our Gemma 4 vs Llama 4 local deployment comparison for benchmark detail.
Is the 4K context default ever going to change?
No public timeline. Ollama maintains the 4K default for safety on small machines. Treat the override as a required setup step.
Where this fits in your stack
A local Gemma 4 endpoint is a building block, not a finished product. Common paths from here:
- Wire it into an agent framework — see the OpenClaw + Ollama setup guide for running local AI agents (the pillar reference for this cluster).
- Read the full Gemma 4 PC and devices guide for non-Ollama options (LM Studio, llama.cpp directly, MLX).
- Compare Gemma against the Qwen family in our Gemma vs Qwen comparison.
- If your team is short on engineers to wire local inference into a product, Codersera places vetted remote ML and full-stack developers who ship — usually with risk-free trial.
References and further reading
- Gemma 4: Byte for byte, the most capable open models — Google blog (April 2, 2026)
- Gemma 4 — Google DeepMind model page
- Welcome Gemma 4: Frontier multimodal intelligence on device — Hugging Face announcement
- Gemma 4 model card — Google AI for Developers
- gemma4 — Ollama library page (tags, sizes, defaults)
- Ollama releases — GitHub (current 0.22.x)
- Gemma 4 — How to Run Locally (Unsloth Documentation)
- r/LocalLLaMA — Gemma 4 discussion threads