Local LLMs

Run Gemma 4 on Windows: Step-by-Step Guide (2026)

Published 23 May 2026 • Updated 23 May 2026 • 7 min read

Quick answer. To run Gemma 4 on Windows, install Ollama from ollama.com, open PowerShell, and run ollama pull gemma4:e4b followed by ollama run gemma4:e4b. The E4B (~9.6 GB) variant fits comfortably on 16 GB systems. For a GUI, install LM Studio, search “gemma 4”, download a Q4_K_M build, and load it in the chat tab. Larger variants (26B MoE, 31B dense) need 18–20 GB VRAM at INT4.

Updated 2026-05-23.

Gemma 4 is Google’s 2026 open-weights model family. It comes in four sizes — E2B and E4B for laptops and phones, plus 26B-A4B (mixture-of-experts) and 31B for workstations — and runs cleanly on Windows through both Ollama and LM Studio.

This guide is Windows-specific. It mirrors our Qwen 3 on Windows walkthrough: install steps, the right variant for your hardware, both Ollama (CLI) and LM Studio (GUI) paths, and a few real prompts to verify the install.

What is Gemma 4?

Gemma 4 is the fourth generation of Google’s open-weights model family. It is multimodal (text + image + audio on E2B/E4B), instruction-tuned, and released under the Gemma license — permissive enough for commercial use with attribution. The model card lives on Hugging Face under google/.

For a deeper architectural look, see our Gemma 4 complete guide. This page is focused on the Windows install.

Gemma 4 model variants and VRAM requirements

The first rule for local LLMs on Windows: the model has to fit in either VRAM (GPU) or system RAM (CPU offload). Pick the variant that comfortably fits, with headroom for context tokens and the operating system.

Variant	Params	VRAM (INT4)	VRAM (FP16)	Best for
Gemma 4 E2B	~2B	~1.5 GB	~4 GB	Old laptops, integrated GPUs, edge devices
Gemma 4 E4B	~4B	~3 GB	~8 GB	Most consumer laptops; the recommended starting point
Gemma 4 26B-A4B (MoE)	26B (4B active)	~16–18 GB	~52 GB	Workstations with 24 GB+ GPU; near-31B quality at MoE speed
Gemma 4 31B	31B	~18–20 GB	~62 GB	Strongest single Gemma 4 model; needs a 24 GB+ GPU

Quick sanity checks before you pull a model:

4–8 GB GPU: stick to E2B.
8–12 GB GPU or 16 GB system RAM, no GPU: E4B with Q4_K_M quantization.
16–24 GB GPU: 26B-A4B MoE at Q4 is the sweet spot.
24 GB+ GPU: 31B at Q4_K_M for the best Gemma 4 quality you can run locally.

If you only have a CPU, Ollama will still run E2B and E4B at acceptable speeds on a modern Ryzen / Intel chip with 16 GB RAM — expect 5–15 tokens/sec.

Windows prerequisites

Windows 10 22H2 or Windows 11. Ollama requires Windows 10 64-bit or later.
16 GB RAM minimum for E4B; 32 GB recommended for 26B/31B if you do not have a 24 GB+ GPU.
NVIDIA GPU (optional but recommended). CUDA 12.x drivers; Ollama detects the GPU automatically.
~20 GB free disk per model variant (more for FP16 builds).
PowerShell (built into Windows) or Windows Terminal.

How to install Gemma 4 on Windows with Ollama

Ollama is the most reliable path for Windows. It is a single installer, runs as a background service, and exposes a local OpenAI-compatible API on http://localhost:11434.

Step 1. Install Ollama

Go to ollama.com/download and grab the Windows installer (OllamaSetup.exe).
Run the installer. It registers a background service and adds ollama to your PATH.
Open PowerShell and verify:

ollama --version

Step 2. Pull Gemma 4

Pull the variant that fits your hardware. The default tag is the E4B build:

# Recommended starting point (~9.6 GB download)
ollama pull gemma4:e4b

# Tiny variant for old laptops
ollama pull gemma4:e2b

# 26B MoE for workstations
ollama pull gemma4:26b

# 31B dense flagship
ollama pull gemma4:31b

Downloads pause and resume cleanly; if your connection drops, re-run the same pull.

Step 3. Run an interactive chat

ollama run gemma4:e4b

The first run loads weights into memory (10–30 seconds on E4B; longer for 26B/31B). After that, you get a prompt:

>>> Explain mixture-of-experts in two sentences.

Type /bye to exit. The model stays warm in memory for ~5 minutes by default.

Step 4. Use the local API from your app

Ollama serves an OpenAI-compatible endpoint at http://localhost:11434. From PowerShell:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:e4b",
  "prompt": "Write a one-paragraph summary of attention.",
  "stream": false
}'

Or in Python:

import requests
r = requests.post("http://localhost:11434/api/generate", json={
    "model": "gemma4:e4b",
    "prompt": "List three benefits of MoE models.",
    "stream": False,
})
print(r.json()["response"])

How to install Gemma 4 on Windows with LM Studio

If you prefer a GUI, LM Studio is the cleanest option on Windows. It bundles llama.cpp, ships a model browser, and exposes the same OpenAI-compatible API on http://localhost:1234.

Download the Windows installer from lmstudio.ai.
Run the installer and launch LM Studio.
Open the Discover tab (magnifier icon), search for gemma 4.
Pick a build that matches your hardware. Start with a Q4_K_M GGUF of Gemma 4 E4B — ~3 GB download, fits in 6–8 GB VRAM with room for context.
Click Download. Once finished, switch to the Chat tab, pick the model in the top dropdown, and send a message.
To serve it as an API, open the Developer tab and click Start Server. Other apps can now hit http://localhost:1234/v1/chat/completions.

LM Studio shows VRAM usage per model in the picker, which makes “will this fit?” obvious before you download.

Quantization: which GGUF version should I pick?

GGUF builds come in many flavours. The trade-off is download size and answer quality.

Quant	Size (vs FP16)	Quality	When to pick
Q4_K_M	~25%	Very good	Default for most users; best size/quality balance
Q5_K_M	~31%	Excellent	You have spare VRAM and want sharper outputs
Q6_K	~38%	Near-FP16	Maximum quality without going FP16
Q8_0	~50%	Essentially lossless	You need reproducibility against the original weights
Q2_K / Q3_K	<20%	Lossy	You have very little VRAM and just need it to fit

Recommendation: start with Q4_K_M E4B. Only move up if outputs feel weak; only move down if it does not fit.

Benchmark it yourself

A 30-second smoke test once Gemma 4 is running:

Reasoning: “A train leaves Chicago at 60 mph. Another leaves New York at 80 mph an hour later. They are 800 miles apart. When do they meet?”
Code: “Write a Python function that returns the longest palindromic substring in a string. Include three test cases.”
Summarization: Paste a 3-paragraph article and ask for a one-sentence summary.

If E4B feels weak on reasoning, step up to 26B-A4B if your VRAM allows — the MoE design means it runs at roughly E4B speed but answers like a much larger model.

Connect Gemma 4 to a chat UI

Running Gemma 4 in PowerShell works, but a desktop client is nicer for daily use. Two clean options on Windows:

Cherry Studio — native Windows desktop, multi-model, RAG, MCP. Point it at http://localhost:11434 under Settings → Model Providers → Ollama.
Open WebUI — self-hosted web UI, runs in Docker. Point it at the same Ollama endpoint.

Troubleshooting

“ollama is not recognized”. Close and reopen PowerShell after install; the PATH update needs a new shell.
Model loads but generation is glacial. Either the model is spilling to system RAM (pick a smaller variant or lower quant), or the GPU is not detected (check ollama ps, update NVIDIA drivers).
Out of memory on 26B/31B. Drop to Q3_K_M or use the 26B MoE variant, which keeps memory closer to 4B-active despite the 26B parameter count.
Slow first-token latency. Normal — the first request after the model unloads has to reload weights. Keep the model warm with OLLAMA_KEEP_ALIVE=1h.
Antivirus blocks Ollama. Whitelist %LOCALAPPDATA%\Programs\Ollama. Defender occasionally flags new releases.

Security and privacy

Running Gemma 4 locally on Windows means prompts, completions and any RAG documents stay on your machine — nothing leaves over the network unless you point a client at a remote backend. For privacy-sensitive workflows (legal review, personal notes, internal docs), local Gemma 4 + a local embedding model is a complete offline stack. For more on the broader self-hosting picture see our self-hosting LLMs guide.

FAQ

How do I install Gemma 4 on Windows?

Install Ollama from ollama.com/download, open PowerShell, and run ollama pull gemma4:e4b then ollama run gemma4:e4b. The E4B variant downloads ~9.6 GB and fits 16 GB systems. For a GUI, install LM Studio and search “gemma 4” in the Discover tab.

Which Gemma 4 variant should I run on Windows?

For most people: E4B with Q4_K_M quantization. It needs ~3 GB VRAM at INT4 and runs well on 8 GB GPUs or CPU-only with 16 GB RAM. Step up to 26B-A4B (MoE) on 16–24 GB GPUs, or 31B on 24 GB+ GPUs. Use E2B if your machine has less than 8 GB total memory headroom.

How much VRAM does Gemma 4 need?

At INT4 quantization: E2B ~1.5 GB, E4B ~3 GB, 26B MoE ~16–18 GB, 31B ~18–20 GB. At FP16: E2B ~4 GB, E4B ~8 GB, 26B ~52 GB, 31B ~62 GB. Add 1–2 GB for context and the OS.

Can I run Gemma 4 on Windows without a GPU?

Yes, for E2B and E4B. Ollama runs them on CPU at acceptable speeds (5–15 tokens/sec on a modern Ryzen / Intel chip with 16 GB RAM). The 26B and 31B variants are too slow on CPU for interactive use.

Ollama or LM Studio for Gemma 4 on Windows?

Ollama if you are comfortable in PowerShell and want a background service you can hit from any app via the local API. LM Studio if you want a polished GUI with a built-in model browser and per-model VRAM estimates. Both ship the same llama.cpp under the hood.

Where does Ollama store Gemma 4 weights on Windows?

Under %USERPROFILE%\.ollama\models. To move them to another drive, set the OLLAMA_MODELS environment variable before starting the Ollama service.

How do I free Gemma 4’s memory after I am done?

Run ollama stop gemma4:e4b, or wait for the default 5-minute keep-alive to expire. The model unloads automatically when idle.

Can Gemma 4 do image input on Windows?

Gemma 4 E2B and E4B are multimodal in the base weights, but multimodal input depends on the runtime. As of mid-2026 the GGUF builds in Ollama and LM Studio expose text only; for image input use the safetensors weights via Hugging Face Transformers, or wait for the runtimes to expose the vision projector.

Is Gemma 4 free for commercial use?

Yes, under the Gemma Terms of Use. Attribution is required (a notice that the model is Gemma-based) and there are use-case prohibitions in the license. Read the current Gemma Terms on Google’s site before shipping a commercial product on top of it.

Why is my first prompt slow but the next ones are fast?

The first request loads the model weights into memory (10–30 s for E4B; longer for 26B/31B). After that the model stays warm for 5 minutes by default. Set OLLAMA_KEEP_ALIVE=1h to keep it warm longer between sessions.

Closing

Gemma 4 on Windows is a 10-minute install once you know what to pull. Start with ollama pull gemma4:e4b, verify with a chat, then either point Cherry Studio or LM Studio at the local endpoint for a real workflow. From here you have a fully offline Gemma 4 setup that costs nothing to run and keeps your data on your machine.