How to Run Gemma 3 and Gemma 4 on Ubuntu: 2026 Setup Guide
Last updated April 2026 — refreshed for Gemma 4 (released April 2, 2026), current Ollama releases, and the Gemma 3 lineage that still ships in production.
Google DeepMind released Gemma 4 on April 2, 2026 under Apache 2.0, superseding the Gemma 3 lineup released in March 2025. This guide is the practical reference for running both Gemma 3 and Gemma 4 on Ubuntu — which models fit which GPU, the exact ollama and Hugging Face commands that work today, the CUDA setup that the official Ollama Linux docs require, and the troubleshooting steps for the issues r/LocalLLaMA is actually hitting in April 2026 (KV-cache pressure, llama.cpp tokenizer bugs, function-calling format errors at 4-bit).
What changed in 2026Gemma 4 launched April 2, 2026 with four variants: E2B (~2.3B effective, edge), E4B (~4.5B effective, edge), 26B-A4B (Mixture-of-Experts, 4B active) and 31B Dense — all under Apache 2.0, a friendlier license than the Gemma 3 custom terms.Context windows doubled on the larger sizes: 26B and 31B Gemma 4 ship with 256K context; E2B/E4B keep 128K. Gemma 3 was 32K (1B) / 128K (4B–27B).Multimodality expanded. Gemma 4 E2B/E4B accept text, image, video and audio; the 26B/31B models accept text, image and silent video. Gemma 3's 1B was text-only; 4B/12B/27B handled text+image only.LMArena: Gemma 4 31B-IT scores ~1452, 26B-A4B ~1441 — putting the 31B at #3 among open models on the text leaderboard at launch.Ollama tag layout changed.gemma4:e2b,gemma4:e4b,gemma4:26b,gemma4:31bare the official tags; Gemma 3'sgemma3:1b/4b/12b/27btags remain available.Tooling caveats. Several llama.cpp tokenizer/KV-cache bugs against Gemma 4 landed and were patched in April 2026 — quantized weights uploaded before the patches will misbehave. Pull fresh GGUFs.
Want the full picture? Read our continuously-updated Gemma 4 Complete Guide (2026) — small-footprint open weights, on-device deployment, and benchmarks.
TL;DR — which Gemma to run
| Your GPU / RAM | Best pick (April 2026) | Command |
|---|---|---|
| CPU only / 8 GB RAM | Gemma 4 E2B | ollama run gemma4:e2b |
| RTX 3060 / 8–12 GB VRAM | Gemma 4 E4B or Gemma 3 4B | ollama run gemma4:e4b |
| RTX 4070 / 16 GB VRAM | Gemma 3 12B (Q4) | ollama run gemma3:12b |
| RTX 4090 / 24 GB VRAM | Gemma 4 26B-A4B (MoE) | ollama run gemma4:26b |
| RTX 5090 / A100 40 GB+ | Gemma 4 31B Dense | ollama run gemma4:31b |
| Mac M-series (covered separately) | MLX runtime — see the Mac guide | — |
If you intend to wire this into an agent loop (tool use, MCP, file ops), pair the model with the OpenClaw + Ollama setup guide for running local AI agents — that pillar covers the orchestration layer that this article deliberately leaves out of scope.
Gemma 3 vs Gemma 4 at a glance
| Gemma 3 (Mar 2025) | Gemma 4 (Apr 2026) | |
|---|---|---|
| Sizes | 1B, 4B, 12B, 27B (also a 270M) | E2B, E4B, 26B-A4B (MoE), 31B Dense |
| License | Gemma custom terms | Apache 2.0 |
| Context window | 32K (1B) / 128K (others) | 128K (E*) / 256K (26B, 31B) |
| Modalities in | Text; +image on 4B/12B/27B | Text+image all sizes; audio+video on E2B/E4B; silent video on 26B/31B |
| Architecture note | Dense | Alternating local/global attention, dual RoPE, per-layer embeddings, shared KV cache; 26B is MoE with 4B active |
| Languages | 140+ | 140+ |
| LMArena (text) | 27B-IT mid-tier open | 31B-IT ~1452 (#3 open at launch); 26B-A4B ~1441 |
Prerequisites
Hardware
- CPU-only works for E2B / 1B / 4B / E4B at small context — slow but functional.
- NVIDIA GPU (recommended). Approximate VRAM at 4-bit (Q4_K_M) inference, room for 8K context:
- Gemma 3 1B / Gemma 4 E2B — ~2 GB
- Gemma 3 4B / Gemma 4 E4B — ~4–6 GB
- Gemma 3 12B — ~9–10 GB
- Gemma 4 26B-A4B — ~16–18 GB (MoE; only 4B params active per token)
- Gemma 3 27B — ~18–20 GB
- Gemma 4 31B Dense — ~20–24 GB; comfortable on 32–40 GB once you account for KV cache at 256K context
- AMD GPU: ROCm v7 works with Ollama's
rocmtarball. Olderamdgpukernel modules from the upstream Ubuntu kernel may miss features; AMD's docs recommend their latest driver. - Disk: ~50 GB free is plenty for a couple of variants. The
gemma4:31btag is ~20 GB on disk,gemma4:26bis ~18 GB, the E* variants are 7–10 GB. - RAM: 16 GB minimum, 32 GB strongly recommended once you cross the 12B mark, especially for long-context runs.
Software
- Ubuntu 22.04 LTS, 24.04 LTS, or 26.04 LTS (64-bit). Ubuntu 20.04 is end-of-standard-support; upgrade if you can.
- NVIDIA driver + CUDA — the easiest path on 24.04/26.04 is
sudo ubuntu-drivers installfollowed by a reboot. Confirm withnvidia-smi. - Python ≥ 3.10 (Hugging Face Transformers ≥ 4.50 dropped 3.8/3.9).
- For the Hugging Face path: a token from huggingface.co with the Gemma terms accepted on the model card (Gemma 3) — Gemma 4 is Apache 2.0 and does not require gating.
Method 1 — Ollama (the path most people should take)
Ollama is the lowest-friction way to run either Gemma generation on Ubuntu. The current release (v0.22.x as of late April 2026) ships pre-quantized GGUFs for both gemma3 and gemma4 tags and handles the GPU runtime automatically.
1. Install Ollama
sudo apt update && sudo apt upgrade -y
sudo apt install -y pciutils lshw curl
curl -fsSL https://ollama.com/install.sh | sh
The installer detects your GPU, drops the binary in /usr/local/bin/ollama, creates an ollama system user, and registers /etc/systemd/system/ollama.service. On a vanilla Ubuntu 26.04 box the download is roughly 1.6 GB because both CPU and GPU runtimes ship together.
2. Verify the GPU is visible
nvidia-smi
systemctl status ollama
journalctl -u ollama -n 50 --no-pager | grep -iE 'gpu|cuda|rocm'
Look for a line like inference compute followed by your GPU name. If Ollama logs no compatible GPUs were discovered, the CUDA driver isn't loaded — fix that before pulling a model so you don't waste 20 GB of bandwidth on a CPU-only run you'll redo.
3. Pull and run a model
# Gemma 4 (April 2026)
ollama run gemma4:e2b # phone-class, ~7.2 GB on disk
ollama run gemma4:e4b # edge / laptop, ~9.6 GB
ollama run gemma4:26b # MoE, ~18 GB, fits a 24 GB GPU
ollama run gemma4:31b # dense, ~20 GB
# Gemma 3 (still useful, especially 12B for 16 GB GPUs)
ollama run gemma3:1b
ollama run gemma3:4b
ollama run gemma3:12b
ollama run gemma3:27b
Confirm what's installed with ollama list. Hit the local API on port 11434 from any other process:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e4b",
"prompt": "Summarise the differences between Gemma 3 and Gemma 4 in three bullets.",
"stream": false
}'
4. Useful systemd overrides
sudo systemctl edit ollama
Drop in environment knobs that matter for Gemma's long context:
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
OLLAMA_FLASH_ATTENTION=1 + OLLAMA_KV_CACHE_TYPE=q8_0 is the combination r/LocalLLaMA settled on to make the 31B fit in 24 GB at long context — without it, the KV cache for 256K context blows past most consumer cards.
Method 2 — Hugging Face Transformers (when you want to fine-tune or script around it)
Pick this path if you need PEFT/LoRA fine-tuning, multimodal pipelines, or programmatic control over generation that Ollama doesn't expose.
python3 -m venv ~/.venvs/gemma && source ~/.venvs/gemma/bin/activate
pip install -U "transformers>=4.50" accelerate torch torchvision pillow
# Optional, for quantization & LoRA
pip install bitsandbytes peft trl
huggingface-cli login # required for Gemma 3; Gemma 4 is Apache-2.0 and ungated
Minimal Gemma 4 inference (text + image, using the any-to-any pipeline introduced for multimodal Gemma 4):
from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e4b-it")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/sample/cat.jpg"},
{"type": "text", "text": "Describe this image in one sentence."},
],
}]
print(pipe(messages, max_new_tokens=80)[0]["generated_text"])
Gemma 3 27B-IT, text-only, with 4-bit quantization for a 24 GB card:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("google/gemma-3-27b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-27b-it",
device_map="auto",
quantization_config=bnb,
)
prompt = "Explain CUDA streams in three sentences."
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=200)[0], skip_special_tokens=True))
For LoRA fine-tuning, Unsloth has the most reliable Gemma 4 recipes today (single-GPU, 4-bit base, ~2× faster than vanilla TRL). For the 26B-A4B MoE, route through TRL's expert-aware trainer; the small-MoE LoRA story on Unsloth is still maturing as of late April 2026.
CUDA setup & troubleshooting on Ubuntu
This is the section the original 2025 post truncated. Run through it top to bottom when GPU acceleration isn't working.
Clean CUDA install on Ubuntu 24.04 / 26.04
# 1. Wipe any half-installed driver
sudo apt purge -y 'nvidia-*' 'libnvidia-*' 'cuda-*'
sudo apt autoremove -y
# 2. Use Ubuntu's recommended driver picker
sudo ubuntu-drivers install
sudo reboot
# 3. After reboot, verify
nvidia-smi
# expect: Driver Version, CUDA Version, your GPU listed
For Ollama you generally do not need the full CUDA Toolkit — the bundled runtime is enough. Install the toolkit only if you're compiling llama.cpp or running PyTorch from source. The Toolkit's .deb (network) from developer.nvidia.com/cuda-downloads is the path that doesn't conflict with Ubuntu's driver stack.
Common Ubuntu + Gemma errors
| Symptom | Cause | Fix |
|---|---|---|
nvidia-smi works but Ollama logs no compatible GPUs | Ollama daemon started before driver loaded, or it's running as the ollama user without GPU access. | sudo systemctl restart ollama; check id ollama belongs to video/render groups. |
CUDA out of memory on 24 GB card with 31B | KV cache at 256K context. Each Gemma 4 31B token costs ~256 KB of KV memory. | Set OLLAMA_KV_CACHE_TYPE=q8_0 (or q4_0), or cap context with num_ctx to 32–64K. |
| Garbage tokens, repeating output, broken tool calls | Stale GGUF predates April 2026 llama.cpp tokenizer fixes. | ollama rm gemma4:31b && ollama pull gemma4:31b to grab the re-uploaded weights. |
| 15% of function calls return malformed JSON on Gemma 4 31B Q4 | Documented r/LocalLLaMA finding for 4-bit function-calling. | Use Q8_0 weights for tool-use workloads, or constrain output with grammar/JSON-schema sampling. |
Could not load library libcuda.so.1 | NVIDIA driver mismatch after kernel upgrade. | Reinstall: sudo apt install --reinstall nvidia-driver-<version>; reboot. |
| Inference uses CPU even with a working GPU | Model file larger than free VRAM at load time. | Free VRAM (nvidia-smi), drop to a smaller variant, or quantize harder (Q4_K_S). |
403 Forbidden on huggingface-cli download google/gemma-3-... | You haven't accepted Gemma 3's license on the model card. | Open the HF model page in a browser, click "Acknowledge license", retry. |
Performance reference (April 2026)
- Gemma 4 E2B at Q8 on RTX 5090 laptop: 77+ tokens/s.
- Gemma 4 31B on llama.cpp across Q4_K_M / Q8_0 / FP16: Q4 is ~3–4× faster than FP16 and ~2× faster than Q8.
- Gemma 4 31B-IT LMArena (text-only): ~1452, #3 open model at launch; 26B-A4B at ~1441 with only 4B active params.
- Reported benchmarks from the model card: MMLU-Pro 85.2%, MMMU-Pro 76.9%, AIME 2026 89.2%, LiveCodeBench 80.0%, GPQA Diamond 84.3%, τ²-bench (agentic tools) 86.4%.
Treat single-thread laptop numbers as a ceiling, not a floor — your real throughput depends on context length, KV-cache quant, and concurrent requests.
How to choose: a five-minute decision tree
- Are you fine-tuning? Use Hugging Face + Unsloth on Gemma 4 E4B or Gemma 3 4B — the small dense models are the sweet spot for LoRA on a single 24 GB GPU.
- Is the workload agentic / tool-using? Gemma 4 31B-IT at Q8 is the strongest open option; pair with the OpenClaw + Ollama setup guide for running local AI agents. Avoid Q4 for production tool calls until the format-error rate drops.
- Edge / privacy / on-device? Gemma 4 E2B or E4B. They're the only Gemma sizes with audio in.
- Long-context RAG (≥128K)? Gemma 4 26B-A4B fits 256K context on 24 GB with q8_0 KV cache and is the cheapest token-for-token option in this range.
- Stuck on Gemma 3? The 12B remains a great 16 GB-VRAM model, and your existing prompts will keep working — just don't expect parity with Gemma 4 on reasoning or tool-use.
Common pitfalls
- Trusting day-one quantizations. The first week of April 2026 saw four re-uploads of Gemma 4 GGUFs as llama.cpp landed tokenizer and KV-cache fixes. If you pulled before mid-April, re-pull.
- Over-quantizing for tool use. 4-bit looks fine on chit-chat and breaks JSON tool calls. Use Q8 (or constrained decoding) for agents.
- Forgetting Apache 2.0 ≠ unrestricted. Gemma 4's Apache 2.0 license is permissive, but Google still publishes a Prohibited Use Policy you should read before shipping commercially.
- Using
ollama quantizefrom old guides. That subcommand isn't part of the current Ollama CLI; usellama.cpp'sllama-quantizebinary if you need a custom quant. - Confusing Gemma 4 26B-A4B's footprint. "26B" describes total params; only ~4B activate per token. Memory footprint, not active compute, is what your VRAM is paying for.
FAQ
Is Gemma 4 free for commercial use?
Yes — Apache 2.0, subject to Google's Prohibited Use Policy. Gemma 3 is under the older Gemma terms, which allow commercial use but with extra acceptable-use restrictions.
Should I install Ollama via the install script or Snap?
The install script (curl -fsSL https://ollama.com/install.sh | sh) is the upstream-supported path and matches the official Linux docs. Snap works but lags releases by days to weeks.
Can I run Gemma 4 31B on a 16 GB GPU?
Only with aggressive quantization (Q4_K_S) and short context (≤8K). You'll lose enough quality and speed that gemma4:26b (MoE) or gemma3:12b are usually the better picks at that VRAM tier.
Does Gemma 4 support audio out?
No. Gemma 4 E2B/E4B accept audio in; output is text. For TTS you still want a separate model.
Why is my Gemma 4 model slower than my old Gemma 3 setup?
Most often it's KV-cache pressure from the 256K context window. Drop num_ctx, enable flash attention, and quantize the KV cache. Second most often: a stale GGUF from before the April 2026 llama.cpp fixes.
Where do I get reliable GGUFs?
Pull from ollama.com/library/gemma4 (and gemma3) for one-command setup, or from huggingface.co/unsloth for hand-tuned quants with patch notes per re-upload. Avoid random one-off uploads.
Where does this fit in a Codersera build?
If you're shipping a product feature on local Gemma — privacy-sensitive RAG, internal copilots, on-prem deployments — Codersera's vetted remote developers can extend your team for the integration and MLOps work without forcing you to staff up full-time. Pair this guide with the pillar OpenClaw walkthrough for the agent layer and you have an end-to-end stack.
Related Codersera guides
- OpenClaw + Ollama setup guide for running local AI agents — pillar for the agent orchestration layer.
- How to run Gemma on macOS (MLX path)
- How to run Gemma on Windows
- DeepSeek Janus-Pro 7B on Mac with ComfyUI
References & further reading
- Google — Gemma 4: Byte for byte, the most capable open models (official launch post)
- Google DeepMind — Gemma 4 model page
- Google AI for Developers — Gemma 4 model card
- Hugging Face — Welcome Gemma 4: Frontier multimodal intelligence on device
- Hugging Face — google/gemma-3-27b-it model card
- Ollama — Official Linux installation docs
- Ollama — GitHub releases
- Ollama library — gemma4 tags & sizes
- Unsloth — Run & fine-tune Gemma 4 locally
- r/LocalLLaMA — active April 2026 threads on Gemma 4 KV-cache, llama.cpp tokenizer fixes, and 4-bit function-calling format errors
- Hacker News — Unsloth on the four Gemma 4 re-uploads following 20 llama.cpp bug fixes