5 min to read
Google's Gemma 4 is one of the best open-weight model families available in 2026 — and running Gemma 4 with Ollama is the fastest way to get it working locally. In this guide you'll complete the full Gemma 4 Ollama setup: install Ollama, pull the right model size for your hardware, configure context, and hit the local REST API. Everything works on Mac, Linux, and Windows.
If you want a full model overview before diving into setup, see our guide on running Gemma 4 on your PC and devices locally, which covers all available run methods including LM Studio and direct Python inference.
Gemma 4 is Google's fourth generation of open-weight language models. The family spans four sizes — two compact edge models (E2B, E4B) and two larger variants (26B MoE, 31B Dense) — all supporting vision input, native function calling, and a massive context window (128K–256K tokens). Ollama wraps local inference into a single command-line tool and a local REST server, making it the lowest-friction path to run Gemma 4 locally.
Deciding which model generation to use? The Gemma 4 vs Gemma 3 vs Gemma 3n comparison breaks down what changed across generations and which variant to pick for your use case.
Pick your variant based on available VRAM (GPU) or RAM (Apple Silicon / CPU-only):
Apple Silicon: M1/M2/M3/M4 Macs use unified memory, so the VRAM figures map to your RAM. A 16 GB M2 Mac runs E4B comfortably; 32 GB handles the 26B MoE.
AMD GPUs: AMD announced day-0 support for all Gemma 4 variants — ROCm 6.x with an RX 7900 or better is the recommended configuration.
# Download from ollama.com (recommended), or via Homebrew:
brew install ollamacurl -fsSL https://ollama.com/install.sh | shDownload the .exe installer from ollama.com and run it. The Ollama service starts automatically and listens on port 11434.
Confirm the install:
ollama --versionThe default tag fetches E4B (the recommended starting point). Download sizes are large, so run this on a good connection:
# Default — downloads E4B (~9.6 GB)
ollama pull gemma4
# Specific variants:
ollama pull gemma4:e2b # ~7.2 GB — smallest, runs on almost anything
ollama pull gemma4:26b # ~18 GB — MoE, strong quality on mid-range GPU
ollama pull gemma4:31b # ~20 GB — dense, best quality, high hardware barVerify the download completed:
ollama listThe 26B uses a Mixture of Experts architecture — only 3.8B parameters activate per inference pass, making it faster than its 18 GB size suggests. It is the best choice if you have a 24 GB GPU and want maximum output quality.
Start an interactive session:
ollama run gemma4At the prompt, test the model:
>>> Explain transformer attention in two sentences for a software engineer.Exit the session:
>>> /byeRun a single prompt non-interactively (useful for scripting):
ollama run gemma4 "Write a Python function that flattens a nested list."The most important configuration step most guides skip: Ollama's default context window is 4K tokens, not 128K. This significantly limits Gemma 4's real capability. Always set num_ctx explicitly.
>>> /set parameter num_ctx 32768Common values: 16384 (16K), 32768 (32K), 131072 (128K — full E4B capability). Larger context consumes more VRAM for the KV cache, so start at 32K and increase if needed.
Create a file named Modelfile:
FROM gemma4
PARAMETER num_ctx 32768
PARAMETER num_gpu 99num_gpu 99 tells Ollama to offload all model layers to GPU — 99 means "as many layers as are available." On CPU-only setups, omit this line.
Build and run your custom config:
ollama create gemma4-32k -f Modelfile
ollama run gemma4-32kThis persists across sessions — you will see gemma4-32k in ollama list alongside the base model.
Ollama exposes a REST API at http://localhost:11434. Use this to integrate Gemma 4 into scripts, tools, and applications.
curl http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{
"model": "gemma4",
"prompt": "Summarize why local AI inference matters for developer privacy.",
"stream": false
}'curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "What are the top 3 use cases for a 128K context window?"}
]
}'import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma4",
"prompt": "List five practical uses for local LLM inference.",
"stream": False,
},
)
print(response.json()["response"])The /v1/chat/completions endpoint is OpenAI-compatible, meaning existing tools and libraries that target the OpenAI SDK work against your local Ollama instance with a base URL change. For building full agent workflows on top of Ollama, see our OpenClaw + Ollama agent setup guide.
A known issue with some Gemma 4 builds causes Flash Attention to misreport GPU usage. Check whether GPU is active:
ollama psIf the GPU column is empty despite having a compatible GPU, add PARAMETER num_gpu 99 to your Modelfile and recreate the model. On Linux, verify CUDA is visible with nvidia-smi.
You are targeting a model larger than your available VRAM. Switch to a smaller tag (gemma4:e4b instead of gemma4:26b), or reduce the context window: /set parameter num_ctx 8192.
# Start manually (Mac/Linux):
ollama serve
# Confirm it is listening:
curl http://localhost:11434Ensure you are using a recent Ollama release, which enables Apple's MLX framework for M-series chips. Run ollama --version and update from ollama.com if you are on an older release.
Re-run ollama pull gemma4 — Ollama automatically resumes interrupted downloads from where they stopped.
OLLAMA_KEEP_ALIVE=-1 to prevent Ollama from unloading the model after 5 minutes of inactivity. On Linux with systemd, add it via systemctl edit ollama as an environment override. On Mac and Windows, set it in your shell profile.num_gpu 20 (or any value less than the total layer count) to offload part of the model to GPU and handle the rest on CPU.Your Gemma 4 Ollama setup gives you a local inference endpoint ready to power tools, agents, and applications. Here is where to go next:
The E4B model at 32K context covers the majority of developer use cases. Bump to the 26B MoE when you need higher quality and have the hardware for it. And remember: the single configuration step that unlocks Gemma 4's real capability is setting num_ctx — do not leave it at the 4K default.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.