Last updated April 2026 — refreshed for current model/tool versions.
Google released Gemma 4 on April 2, 2026, superseding Gemma 3 as DeepMind's flagship open-weight family. This guide shows how to run both generations locally on a Mac (Apple Silicon M1–M4): Gemma 3 if you have older tooling or smaller hardware, and Gemma 4 if you want the current state of the art with native multimodal input and a 256K context window.
What changed in 2026
- Gemma 4 launched April 2, 2026 with four sizes: E2B (~2.3B), E4B (~4.5B), a 26B Mixture-of-Experts (~4B active), and a 31B dense flagship. Apache 2.0 license, 256K context, 140+ languages, native text+image (audio/video on smaller variants).
- Ollama on macOS now runs on MLX (March 30, 2026) — Gemma 4 inference on Apple Silicon is materially faster and uses less unified memory than the old Metal path.
- Hugging Face CLI install changed:huggingface-cliis no longer a Homebrew formula. Install viapip install -U "huggingface_hub[cli]". The modern command name ishf(legacyhuggingface-clistill ships as an alias).
- Gemma 3 still works and is a reasonable fallback if you're already on it; don't rip it out unless you need 4's multimodal or longer context.
Want the full picture? Read our continuously-updated Gemma 4 Complete Guide (2026) — small-footprint open weights, on-device deployment, and benchmarks.
Why run Gemma locally on a Mac
- Privacy: prompts and documents never leave the device.
- Latency: first-token latency on M3/M4 with MLX is sub-200 ms for E4B-class models.
- Cost: zero per-token cost after the one-time download.
- Offline: useful for travel, regulated environments, and CI runs without API keys.
- Agentic workflows: Gemma 4 ships with native function-calling, which pairs well with the OpenClaw + Ollama setup guide for running local AI agents if you want a fully local agent loop.
Model sizes — Gemma 3 vs Gemma 4
| Family | Variant | Params | Disk (Q4_K_M) | Min unified RAM | Notes |
|---|---|---|---|---|---|
| Gemma 3 | 1B | 1B | ~0.8 GB | 8 GB | Edge / draft model |
| Gemma 3 | 4B | 4B | ~2.5 GB | 8 GB | Good general-purpose |
| Gemma 3 | 12B | 12B | ~7 GB | 16 GB | Quality bump, fits M-series Pro |
| Gemma 3 | 27B | 27B | ~16 GB | 32 GB | Top of the Gemma 3 line |
| Gemma 4 | E2B | ~2.3B effective | ~1.5 GB | 8 GB | Mobile/edge, multimodal |
| Gemma 4 | E4B | ~4.5B effective | ~3 GB | 8 GB | Default for M1/M2 Air |
| Gemma 4 | 26B (MoE) | 26B total / ~4B active | ~14 GB | 24–32 GB | Fast inference, big knowledge |
| Gemma 4 | 31B dense | 30.7B | ~18 GB | 32 GB+ | Flagship; 1452 Arena AI, 89.2% AIME 2026, 80.0% LiveCodeBench |
If you're choosing between generations: pick Gemma 4 unless you have a specific reason (existing fine-tunes, tooling lock-in) to stay on Gemma 3. Gemma 4's E4B beats Gemma 3 4B on every published benchmark and adds vision input.
Prerequisites
Hardware
- Apple Silicon required for usable speed (M1, M2, M3, or M4 — including base, Pro, Max, and Ultra variants). Intel Macs run via CPU only and are not recommended.
- Unified memory: 8 GB handles E2B/E4B and Gemma 3 1B/4B; 16 GB handles Gemma 3 12B and Gemma 4 26B with care; 32 GB+ for Gemma 4 31B and Gemma 3 27B.
- Disk: budget ~25 GB free for the larger models plus overhead.
Software
- macOS: macOS Sonoma (14) or later. macOS Sequoia (15) and Tahoe (26) are recommended for the latest MLX kernels.
- Python: 3.11 or 3.12 (3.9 still works but is end-of-life as of October 2025).
- Tooling: Homebrew, Ollama 0.6+,
huggingface_hub[cli], optionallyuvor Conda for environments.
Install the toolchain
1. Install Ollama
Ollama 0.6+ on macOS uses MLX as the inference backend on Apple Silicon, which is what you want.
brew install ollama
ollama serve &
Or install the desktop app from ollama.com/download if you'd rather have a tray icon.
2. Create a Python environment
Use uv (fast) or Conda. uv example:
brew install uv
uv venv ~/.venvs/gemma
source ~/.venvs/gemma/bin/activate
Conda alternative:
brew install --cask miniconda
conda create -n gemma python=3.12 -y
conda activate gemma
3. Install the Hugging Face CLI
The previous version of this post used brew install huggingface-cli, which never worked — there is no Homebrew formula by that name. Install via pip instead:
pip install -U "huggingface_hub[cli]"
hf auth login # paste a read token from https://huggingface.co/settings/tokens
The modern entry point is hf. The legacy huggingface-cli name still works as a shim and you'll see it in older docs.
4. (Optional) Install PyTorch / Transformers for direct use
Only needed if you want to call Gemma from Python directly rather than through Ollama:
pip install -U torch torchvision transformers accelerate
PyTorch ships with MPS (Metal) support out of the box on macOS; no extra wheels needed.
Run Gemma 4 (recommended path)
Pull a model with Ollama
# default — Gemma 4 E4B, runs on any 8 GB+ Apple Silicon Mac
ollama pull gemma4
# 26B Mixture-of-Experts — best quality/speed tradeoff for 24–32 GB
ollama pull gemma4:26b
# 31B dense flagship — needs 32 GB+ unified memory
ollama pull gemma4:31b
Interactive chat
ollama run gemma4 "Summarize the differences between MoE and dense LLMs in 3 bullets."
Use the local OpenAI-compatible API
Ollama exposes an OpenAI-compatible HTTP server on http://localhost:11434/v1. Anything that speaks the OpenAI SDK can point at it:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="gemma4",
messages=[{"role": "user", "content": "What is unified memory?"}],
)
print(resp.choices[0].message.content)
Multimodal (vision) input
Gemma 4 accepts images natively. With Ollama:
ollama run gemma4 "Describe this screenshot." ./screenshot.png
Run Gemma 3 (legacy / fallback path)
If you specifically need Gemma 3 — for an existing fine-tune, a pinned dependency, or to compare generations — the install path is the same except for the pull tag:
ollama pull gemma3:1b # smallest
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b # largest
ollama run gemma3:4b "What is AI?"
Direct download from Hugging Face
hf download google/gemma-3-4b-it
You'll need to accept the Gemma terms on the model page once before download works.
Run directly from Python (transformers + MPS)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "google/gemma-3-4b-it"
device = "mps" if torch.backends.mps.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16
).to(device)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
out = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))
The same script works for Gemma 4 by swapping model_name to e.g. google/gemma-4-e4b-it.
Performance tuning on Apple Silicon
- Use Ollama 0.6+ for MLX: confirm with
ollama --version. The MLX backend roughly doubled tokens/sec for Gemma-class models versus the pre-March 2026 Metal path. - Match quantization to RAM: Ollama defaults to Q4_K_M, which is the right call. Drop to Q3 only if you're memory-bound; expect a small quality hit.
- Keep a single model loaded:
OLLAMA_KEEP_ALIVE=30m ollama serveavoids the cold-load penalty between requests. - Watch
powermetrics:sudo powermetrics --samplers gpu_powershows whether the GPU is actually pegged. If it idles, MPS is not engaged. - Batch when you can: for embedding or classification workloads, batching 8–32 inputs per call is a 3–5× throughput win.
Troubleshooting
brew install huggingface-clifails: expected — there's no such formula. Usepip install -U "huggingface_hub[cli]".- "Out of memory" on 16 GB Mac: drop to E4B (Gemma 4) or 4B (Gemma 3), close Chrome/Slack, and check Activity Monitor's "Memory Pressure" panel.
- MPS not detected:
python -c "import torch; print(torch.backends.mps.is_available())"should printTrue. IfFalse, you're either on Intel or on a too-old PyTorch (need 2.1+). - Ollama hangs on first pull: usually disk space;
df -hand clear the~/.ollama/modelsdirectory if needed. - 403 from Hugging Face: re-run
hf auth loginand accept the model's license on the web UI.
Advanced use
Fine-tuning on a Mac
For LoRA-scale fine-tunes, MLX's own training stack (mlx-lm) is now the fastest path on Apple Silicon and beats transformers + MPS by a wide margin. Full fine-tunes still want a CUDA box.
Local agents and tool use
Gemma 4's native function-calling makes it a credible local backend for coding agents. Pair it with Ollama and a runner like OpenClaw — the OpenClaw + Ollama setup guide walks through the full local agent loop.
Other 2026 local-LLM options on Mac
- DeepSeek V4 — strong reasoning model, larger memory footprint.
- Qwen 3.6 — multilingual leader; Apache 2.0.
- Llama 4 — Meta's flagship open-weight; 17B and 109B MoE variants.
- Phi-4 — Microsoft's small-model line, competitive on reasoning at <10B.
Conclusion
Running Gemma locally on a Mac is straightforward in April 2026: install Ollama, pull gemma4 (or gemma3:4b if you have a reason to stay on the older line), and you have a private, low-latency LLM with a 256K context window and native vision. The MLX-backed inference path on Apple Silicon makes the M-series the strongest commodity hardware for this size class, full stop.