gemma 3

How to Run Gemma 3 and Gemma 4 on a Mac: 2026 Guide

Published 21 Mar 2025 • Updated 22 May 2026 • 6 min read

Quick answer. Gemma 4 (April 2026) is the current path: `ollama run gemma4` for the 26B MoE (~9-12 GB), or `gemma4:31b` for the dense flagship. 8 GB Macs handle E2B/E4B; 32 GB+ recommended for 26B; 64 GB+ for 31B. Gemma 3 still works for older toolchains.

Last updated May 2026 — refreshed for current model/tool versions.

Google released Gemma 4 on April 2, 2026, superseding Gemma 3 as DeepMind's flagship open-weight family. This guide shows how to run both generations locally on a Mac (Apple Silicon M1–M4): Gemma 3 if you have older tooling or smaller hardware, and Gemma 4 if you want the current state of the art with native multimodal input and a 256K context window.

What changed in 2026
- Gemma 4 launched April 2, 2026 with four sizes: E2B (~2.3B), E4B (~4.5B), a 26B Mixture-of-Experts (~4B active), and a 31B dense flagship. Apache 2.0 license, 256K context, 140+ languages, native text+image (audio/video on smaller variants).
- Ollama on macOS now runs on MLX (March 30, 2026) — Gemma 4 inference on Apple Silicon is materially faster and uses less unified memory than the old Metal path.
- Hugging Face CLI install changed: huggingface-cli is no longer a Homebrew formula. Install via pip install -U "huggingface_hub[cli]". The modern command name is hf (legacy huggingface-cli still ships as an alias).
- Gemma 3 still works and is a reasonable fallback if you're already on it; don't rip it out unless you need 4's multimodal or longer context.

Want the full picture? Read our continuously-updated Gemma 4 Complete Guide (2026) — small-footprint open weights, on-device deployment, and benchmarks.

Why run Gemma locally on a Mac

Privacy: prompts and documents never leave the device.
Latency: first-token latency on M3/M4 with MLX is sub-200 ms for E4B-class models.
Cost: zero per-token cost after the one-time download.
Offline: useful for travel, regulated environments, and CI runs without API keys.
Agentic workflows: Gemma 4 ships with native function-calling, which pairs well with the OpenClaw + Ollama setup guide for running local AI agents if you want a fully local agent loop.

Model sizes — Gemma 3 vs Gemma 4

Family	Variant	Params	Disk (Q4_K_M)	Min unified RAM	Notes
Gemma 3	1B	1B	~0.8 GB	8 GB	Edge / draft model
Gemma 3	4B	4B	~2.5 GB	8 GB	Good general-purpose
Gemma 3	12B	12B	~7 GB	16 GB	Quality bump, fits M-series Pro
Gemma 3	27B	27B	~16 GB	32 GB	Top of the Gemma 3 line
Gemma 4	E2B	~2.3B effective	~1.5 GB	8 GB	Mobile/edge, multimodal
Gemma 4	E4B	~4.5B effective	~3 GB	8 GB	Default for M1/M2 Air
Gemma 4	26B (MoE)	26B total / ~4B active	~14 GB	24–32 GB	Fast inference, big knowledge
Gemma 4	31B dense	30.7B	~18 GB	32 GB+	Flagship; 1452 Arena AI, 89.2% AIME 2026, 80.0% LiveCodeBench

If you're choosing between generations: pick Gemma 4 unless you have a specific reason (existing fine-tunes, tooling lock-in) to stay on Gemma 3. Gemma 4's E4B beats Gemma 3 4B on every published benchmark and adds vision input.

Prerequisites

Hardware

Apple Silicon required for usable speed (M1, M2, M3, or M4 — including base, Pro, Max, and Ultra variants). Intel Macs run via CPU only and are not recommended.
Unified memory: 8 GB handles E2B/E4B and Gemma 3 1B/4B; 16 GB handles Gemma 3 12B and Gemma 4 26B with care; 32 GB+ for Gemma 4 31B and Gemma 3 27B.
Disk: budget ~25 GB free for the larger models plus overhead.

Software

macOS: macOS Sonoma (14) or later. macOS Sequoia (15) and Tahoe (26) are recommended for the latest MLX kernels.
Python: 3.11 or 3.12 (3.9 still works but is end-of-life as of October 2025).
Tooling: Homebrew, Ollama 0.6+, huggingface_hub[cli], optionally uv or Conda for environments.

Install the toolchain

1. Install Ollama

Ollama 0.6+ on macOS uses MLX as the inference backend on Apple Silicon, which is what you want.

brew install ollama
ollama serve &

Or install the desktop app from ollama.com/download if you'd rather have a tray icon.

2. Create a Python environment

Use uv (fast) or Conda. uv example:

brew install uv
uv venv ~/.venvs/gemma
source ~/.venvs/gemma/bin/activate

Conda alternative:

brew install --cask miniconda
conda create -n gemma python=3.12 -y
conda activate gemma

3. Install the Hugging Face CLI

The previous version of this post used brew install huggingface-cli, which never worked — there is no Homebrew formula by that name. Install via pip instead:

pip install -U "huggingface_hub[cli]"
hf auth login    # paste a read token from https://huggingface.co/settings/tokens

The modern entry point is hf. The legacy huggingface-cli name still works as a shim and you'll see it in older docs.

4. (Optional) Install PyTorch / Transformers for direct use

Only needed if you want to call Gemma from Python directly rather than through Ollama:

pip install -U torch torchvision transformers accelerate

PyTorch ships with MPS (Metal) support out of the box on macOS; no extra wheels needed.

Run Gemma 4 (recommended path)

Pull a model with Ollama

# default — Gemma 4 E4B, runs on any 8 GB+ Apple Silicon Mac
ollama pull gemma4

# 26B Mixture-of-Experts — best quality/speed tradeoff for 24–32 GB
ollama pull gemma4:26b

# 31B dense flagship — needs 32 GB+ unified memory
ollama pull gemma4:31b

Interactive chat

ollama run gemma4 "Summarize the differences between MoE and dense LLMs in 3 bullets."

Use the local OpenAI-compatible API

Ollama exposes an OpenAI-compatible HTTP server on http://localhost:11434/v1. Anything that speaks the OpenAI SDK can point at it:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "What is unified memory?"}],
)
print(resp.choices[0].message.content)

Multimodal (vision) input

Gemma 4 accepts images natively. With Ollama:

ollama run gemma4 "Describe this screenshot." ./screenshot.png

Run Gemma 3 (legacy / fallback path)

If you specifically need Gemma 3 — for an existing fine-tune, a pinned dependency, or to compare generations — the install path is the same except for the pull tag:

ollama pull gemma3:1b      # smallest
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b     # largest

ollama run gemma3:4b "What is AI?"

Direct download from Hugging Face

hf download google/gemma-3-4b-it

You'll need to accept the Gemma terms on the model page once before download works.

Run directly from Python (transformers + MPS)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "google/gemma-3-4b-it"
device = "mps" if torch.backends.mps.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16
).to(device)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
out = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The same script works for Gemma 4 by swapping model_name to e.g. google/gemma-4-e4b-it.

Performance tuning on Apple Silicon

Use Ollama 0.6+ for MLX: confirm with ollama --version. The MLX backend roughly doubled tokens/sec for Gemma-class models versus the pre-March 2026 Metal path.
Match quantization to RAM: Ollama defaults to Q4_K_M, which is the right call. Drop to Q3 only if you're memory-bound; expect a small quality hit.
Keep a single model loaded: OLLAMA_KEEP_ALIVE=30m ollama serve avoids the cold-load penalty between requests.
Watch powermetrics: sudo powermetrics --samplers gpu_power shows whether the GPU is actually pegged. If it idles, MPS is not engaged.
Batch when you can: for embedding or classification workloads, batching 8–32 inputs per call is a 3–5× throughput win.

Troubleshooting

brew install huggingface-cli fails: expected — there's no such formula. Use pip install -U "huggingface_hub[cli]".
"Out of memory" on 16 GB Mac: drop to E4B (Gemma 4) or 4B (Gemma 3), close Chrome/Slack, and check Activity Monitor's "Memory Pressure" panel.
MPS not detected: python -c "import torch; print(torch.backends.mps.is_available())" should print True. If False, you're either on Intel or on a too-old PyTorch (need 2.1+).
Ollama hangs on first pull: usually disk space; df -h and clear the ~/.ollama/models directory if needed.
403 from Hugging Face: re-run hf auth login and accept the model's license on the web UI.

Advanced use

Fine-tuning on a Mac

For LoRA-scale fine-tunes, MLX's own training stack (mlx-lm) is now the fastest path on Apple Silicon and beats transformers + MPS by a wide margin. Full fine-tunes still want a CUDA box.

Local agents and tool use

Gemma 4's native function-calling makes it a credible local backend for coding agents. Pair it with Ollama and a runner like OpenClaw — the OpenClaw + Ollama setup guide walks through the full local agent loop.

Other 2026 local-LLM options on Mac

DeepSeek V4 — strong reasoning model, larger memory footprint.
Qwen 3.6 — multilingual leader; Apache 2.0.
Llama 4 — Meta's flagship open-weight; 17B and 109B MoE variants.
Phi-4 — Microsoft's small-model line, competitive on reasoning at <10B.

Conclusion

Running Gemma locally on a Mac is straightforward in May 2026: install Ollama, pull gemma4 (or gemma3:4b if you have a reason to stay on the older line), and you have a private, low-latency LLM with a 256K context window and native vision. The MLX-backed inference path on Apple Silicon makes the M-series the strongest commodity hardware for this size class, full stop.