Run Gemma 4 on Windows: Step-by-Step Guide (2026)
Quick answer. To run Gemma 4 on Windows, install Ollama from ollama.com, open PowerShell, and run ollama pull gemma4:e4b followed by ollama run gemma4:e4b. The E4B (~9.6 GB) variant fits comfortably on 16 GB systems. For a GUI, install LM Studio, search “gemma 4”, download a Q4_K_M build, and load it in the chat tab. Larger variants (26B MoE, 31B dense) need 18–20 GB VRAM at INT4.
Updated 2026-05-23.
Gemma 4 is Google’s 2026 open-weights model family. It comes in four sizes — E2B and E4B for laptops and phones, plus 26B-A4B (mixture-of-experts) and 31B for workstations — and runs cleanly on Windows through both Ollama and LM Studio.
This guide is Windows-specific. It mirrors our Qwen 3 on Windows walkthrough: install steps, the right variant for your hardware, both Ollama (CLI) and LM Studio (GUI) paths, and a few real prompts to verify the install.
What is Gemma 4?
Gemma 4 is the fourth generation of Google’s open-weights model family. It is multimodal (text + image + audio on E2B/E4B), instruction-tuned, and released under the Gemma license — permissive enough for commercial use with attribution. The model card lives on Hugging Face under google/.
For a deeper architectural look, see our Gemma 4 complete guide. This page is focused on the Windows install.
Gemma 4 model variants and VRAM requirements
The first rule for local LLMs on Windows: the model has to fit in either VRAM (GPU) or system RAM (CPU offload). Pick the variant that comfortably fits, with headroom for context tokens and the operating system.
| Variant | Params | VRAM (INT4) | VRAM (FP16) | Best for |
|---|---|---|---|---|
| Gemma 4 E2B | ~2B | ~1.5 GB | ~4 GB | Old laptops, integrated GPUs, edge devices |
| Gemma 4 E4B | ~4B | ~3 GB | ~8 GB | Most consumer laptops; the recommended starting point |
| Gemma 4 26B-A4B (MoE) | 26B (4B active) | ~16–18 GB | ~52 GB | Workstations with 24 GB+ GPU; near-31B quality at MoE speed |
| Gemma 4 31B | 31B | ~18–20 GB | ~62 GB | Strongest single Gemma 4 model; needs a 24 GB+ GPU |
Quick sanity checks before you pull a model:
- 4–8 GB GPU: stick to E2B.
- 8–12 GB GPU or 16 GB system RAM, no GPU: E4B with Q4_K_M quantization.
- 16–24 GB GPU: 26B-A4B MoE at Q4 is the sweet spot.
- 24 GB+ GPU: 31B at Q4_K_M for the best Gemma 4 quality you can run locally.
If you only have a CPU, Ollama will still run E2B and E4B at acceptable speeds on a modern Ryzen / Intel chip with 16 GB RAM — expect 5–15 tokens/sec.
Windows prerequisites
- Windows 10 22H2 or Windows 11. Ollama requires Windows 10 64-bit or later.
- 16 GB RAM minimum for E4B; 32 GB recommended for 26B/31B if you do not have a 24 GB+ GPU.
- NVIDIA GPU (optional but recommended). CUDA 12.x drivers; Ollama detects the GPU automatically.
- ~20 GB free disk per model variant (more for FP16 builds).
- PowerShell (built into Windows) or Windows Terminal.
How to install Gemma 4 on Windows with Ollama
Ollama is the most reliable path for Windows. It is a single installer, runs as a background service, and exposes a local OpenAI-compatible API on http://localhost:11434.
Step 1. Install Ollama
- Go to ollama.com/download and grab the Windows installer (
OllamaSetup.exe). - Run the installer. It registers a background service and adds
ollamato your PATH. - Open PowerShell and verify:
ollama --versionStep 2. Pull Gemma 4
Pull the variant that fits your hardware. The default tag is the E4B build:
# Recommended starting point (~9.6 GB download)
ollama pull gemma4:e4b
# Tiny variant for old laptops
ollama pull gemma4:e2b
# 26B MoE for workstations
ollama pull gemma4:26b
# 31B dense flagship
ollama pull gemma4:31bDownloads pause and resume cleanly; if your connection drops, re-run the same pull.
Step 3. Run an interactive chat
ollama run gemma4:e4bThe first run loads weights into memory (10–30 seconds on E4B; longer for 26B/31B). After that, you get a prompt:
>>> Explain mixture-of-experts in two sentences.Type /bye to exit. The model stays warm in memory for ~5 minutes by default.
Step 4. Use the local API from your app
Ollama serves an OpenAI-compatible endpoint at http://localhost:11434. From PowerShell:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e4b",
"prompt": "Write a one-paragraph summary of attention.",
"stream": false
}'Or in Python:
import requests
r = requests.post("http://localhost:11434/api/generate", json={
"model": "gemma4:e4b",
"prompt": "List three benefits of MoE models.",
"stream": False,
})
print(r.json()["response"])How to install Gemma 4 on Windows with LM Studio
If you prefer a GUI, LM Studio is the cleanest option on Windows. It bundles llama.cpp, ships a model browser, and exposes the same OpenAI-compatible API on http://localhost:1234.
- Download the Windows installer from lmstudio.ai.
- Run the installer and launch LM Studio.
- Open the Discover tab (magnifier icon), search for gemma 4.
- Pick a build that matches your hardware. Start with a Q4_K_M GGUF of Gemma 4 E4B — ~3 GB download, fits in 6–8 GB VRAM with room for context.
- Click Download. Once finished, switch to the Chat tab, pick the model in the top dropdown, and send a message.
- To serve it as an API, open the Developer tab and click Start Server. Other apps can now hit
http://localhost:1234/v1/chat/completions.
LM Studio shows VRAM usage per model in the picker, which makes “will this fit?” obvious before you download.
Quantization: which GGUF version should I pick?
GGUF builds come in many flavours. The trade-off is download size and answer quality.
| Quant | Size (vs FP16) | Quality | When to pick |
|---|---|---|---|
| Q4_K_M | ~25% | Very good | Default for most users; best size/quality balance |
| Q5_K_M | ~31% | Excellent | You have spare VRAM and want sharper outputs |
| Q6_K | ~38% | Near-FP16 | Maximum quality without going FP16 |
| Q8_0 | ~50% | Essentially lossless | You need reproducibility against the original weights |
| Q2_K / Q3_K | <20% | Lossy | You have very little VRAM and just need it to fit |
Recommendation: start with Q4_K_M E4B. Only move up if outputs feel weak; only move down if it does not fit.
Benchmark it yourself
A 30-second smoke test once Gemma 4 is running:
- Reasoning: “A train leaves Chicago at 60 mph. Another leaves New York at 80 mph an hour later. They are 800 miles apart. When do they meet?”
- Code: “Write a Python function that returns the longest palindromic substring in a string. Include three test cases.”
- Summarization: Paste a 3-paragraph article and ask for a one-sentence summary.
If E4B feels weak on reasoning, step up to 26B-A4B if your VRAM allows — the MoE design means it runs at roughly E4B speed but answers like a much larger model.
Connect Gemma 4 to a chat UI
Running Gemma 4 in PowerShell works, but a desktop client is nicer for daily use. Two clean options on Windows:
- Cherry Studio — native Windows desktop, multi-model, RAG, MCP. Point it at
http://localhost:11434under Settings → Model Providers → Ollama. - Open WebUI — self-hosted web UI, runs in Docker. Point it at the same Ollama endpoint.
Troubleshooting
- “ollama is not recognized”. Close and reopen PowerShell after install; the PATH update needs a new shell.
- Model loads but generation is glacial. Either the model is spilling to system RAM (pick a smaller variant or lower quant), or the GPU is not detected (check
ollama ps, update NVIDIA drivers). - Out of memory on 26B/31B. Drop to Q3_K_M or use the 26B MoE variant, which keeps memory closer to 4B-active despite the 26B parameter count.
- Slow first-token latency. Normal — the first request after the model unloads has to reload weights. Keep the model warm with
OLLAMA_KEEP_ALIVE=1h. - Antivirus blocks Ollama. Whitelist
%LOCALAPPDATA%\Programs\Ollama. Defender occasionally flags new releases.
Security and privacy
Running Gemma 4 locally on Windows means prompts, completions and any RAG documents stay on your machine — nothing leaves over the network unless you point a client at a remote backend. For privacy-sensitive workflows (legal review, personal notes, internal docs), local Gemma 4 + a local embedding model is a complete offline stack. For more on the broader self-hosting picture see our self-hosting LLMs guide.
Related Codersera guides
- Gemma 4 complete guide (2026)
- Self-hosting LLMs complete guide (2026)
- Cherry Studio complete guide (2026)
- Running Qwen 3 8B on Windows
- How to run Gemma 4 with Ollama
FAQ
How do I install Gemma 4 on Windows?
Install Ollama from ollama.com/download, open PowerShell, and run ollama pull gemma4:e4b then ollama run gemma4:e4b. The E4B variant downloads ~9.6 GB and fits 16 GB systems. For a GUI, install LM Studio and search “gemma 4” in the Discover tab.
Which Gemma 4 variant should I run on Windows?
For most people: E4B with Q4_K_M quantization. It needs ~3 GB VRAM at INT4 and runs well on 8 GB GPUs or CPU-only with 16 GB RAM. Step up to 26B-A4B (MoE) on 16–24 GB GPUs, or 31B on 24 GB+ GPUs. Use E2B if your machine has less than 8 GB total memory headroom.
How much VRAM does Gemma 4 need?
At INT4 quantization: E2B ~1.5 GB, E4B ~3 GB, 26B MoE ~16–18 GB, 31B ~18–20 GB. At FP16: E2B ~4 GB, E4B ~8 GB, 26B ~52 GB, 31B ~62 GB. Add 1–2 GB for context and the OS.
Can I run Gemma 4 on Windows without a GPU?
Yes, for E2B and E4B. Ollama runs them on CPU at acceptable speeds (5–15 tokens/sec on a modern Ryzen / Intel chip with 16 GB RAM). The 26B and 31B variants are too slow on CPU for interactive use.
Ollama or LM Studio for Gemma 4 on Windows?
Ollama if you are comfortable in PowerShell and want a background service you can hit from any app via the local API. LM Studio if you want a polished GUI with a built-in model browser and per-model VRAM estimates. Both ship the same llama.cpp under the hood.
Where does Ollama store Gemma 4 weights on Windows?
Under %USERPROFILE%\.ollama\models. To move them to another drive, set the OLLAMA_MODELS environment variable before starting the Ollama service.
How do I free Gemma 4’s memory after I am done?
Run ollama stop gemma4:e4b, or wait for the default 5-minute keep-alive to expire. The model unloads automatically when idle.
Can Gemma 4 do image input on Windows?
Gemma 4 E2B and E4B are multimodal in the base weights, but multimodal input depends on the runtime. As of mid-2026 the GGUF builds in Ollama and LM Studio expose text only; for image input use the safetensors weights via Hugging Face Transformers, or wait for the runtimes to expose the vision projector.
Is Gemma 4 free for commercial use?
Yes, under the Gemma Terms of Use. Attribution is required (a notice that the model is Gemma-based) and there are use-case prohibitions in the license. Read the current Gemma Terms on Google’s site before shipping a commercial product on top of it.
Why is my first prompt slow but the next ones are fast?
The first request loads the model weights into memory (10–30 s for E4B; longer for 26B/31B). After that the model stays warm for 5 minutes by default. Set OLLAMA_KEEP_ALIVE=1h to keep it warm longer between sessions.
Closing
Gemma 4 on Windows is a 10-minute install once you know what to pull. Start with ollama pull gemma4:e4b, verify with a chat, then either point Cherry Studio or LM Studio at the local endpoint for a real workflow. From here you have a fully offline Gemma 4 setup that costs nothing to run and keeps your data on your machine.