Running Llama 4 on Mac: 2026 Installation Guide (Scout, Maverick, MLX & Ollama)
Last updated April 2026 — refreshed for current model/tool versions.
Meta's Llama 4 family — Scout (109B MoE / 17B active), Maverick (400B MoE / 17B active), and the still-internal Behemoth (~2T) — remains the most capable open-weight, natively multimodal stack you can run on a Mac in April 2026, even after the April 8, 2026 announcement of Llama 5. Scout in particular fits comfortably on Apple Silicon at 4-bit quantization and is the practical default for local inference on macOS.
This guide walks through the current toolchain (Ollama, llama.cpp, MLX), the actual RAM you need for each Llama 4 variant, the right quantization (Q4_K_M, Q5_K_M, FP8) for each Mac tier, and the exact commands that work today.
Want the full picture? Read our continuously-updated Llama 4 Complete Guide (2026) — Scout and Maverick variants, MoE architecture, and deployment patterns.
What changed in 2026
- Llama 5 shipped April 8, 2026. Flagship is ~600B parameters with up to 5M-token context. It is heavier than Llama 4 and, for most local Mac use cases, Llama 4 Scout is still the right pick. Llama 5 lives in a separate Ollama namespace (
ollama pull llama5) once weights are mirrored. - Llama 4 Behemoth is still not publicly released as of April 2026. Meta continues to use it as a teacher model. Do not plan a local run around it.
- Ollama tags consolidated.
llama4:scout(alias ofllama4:16x17b, ~67 GB at Q4_K_M) andllama4:maverick(alias ofllama4:128x17b, ~245 GB at Q4_K_M) are the two real local options. - MLX is now first-class on Apple Silicon. For Scout on M3 Ultra / M4 Max / M5 Max, MLX often beats llama.cpp Metal by 20–40% tokens/sec at the same quantization.
- llama.cpp build switched to CMake. The old
maketarget was removed in late 2025; usecmake -B build. - Quantization defaults moved. Q4_K_M is the new community baseline; FP8 KV-cache is supported on M4/M5-class chips and roughly halves KV memory.
Hardware requirements (April 2026)
Llama 4 is a Mixture-of-Experts model. Only 17B parameters are active per token, but you must hold all expert weights in unified memory. Plan for roughly 70–75% of your unified memory being usable for model weights — macOS, the runtime, and KV cache claim the rest.
| Model | Quantization | On-disk | Recommended unified RAM | Realistic Mac |
|---|---|---|---|---|
| Llama 4 Scout (16x17B) | Q4_K_M | ~67 GB | 96 GB | M3 Max 128 GB, M4 Max 128 GB, M3/M4 Ultra |
| Llama 4 Scout (16x17B) | Q5_K_M | ~78 GB | 128 GB | M3 Ultra 192 GB, M4 Ultra 256 GB |
| Llama 4 Scout (16x17B) | FP8 | ~109 GB | 192 GB | M3/M4 Ultra 192 GB+, M5 Ultra |
| Llama 4 Maverick (128x17B) | Q4_K_M | ~245 GB | 384 GB | M3 Ultra 512 GB Mac Studio, M4 Ultra 512 GB |
| Llama 4 Maverick (128x17B) | Q3_K_M | ~190 GB | 256 GB | M3 Ultra 256 GB+ (tight) |
| Llama 4 Behemoth | — | — | not publicly released | cloud only via partners |
Other requirements
- Chip: Apple Silicon (M-series). Intel Macs are no longer practical for Llama 4 — neither MLX nor Ollama's Metal backend targets them well in 2026.
- Disk: 80 GB free for Scout Q4_K_M, 260 GB for Maverick Q4_K_M. Use the internal SSD; external Thunderbolt drives bottleneck cold-start mmap.
- macOS: 14.5 (Sonoma) or newer. macOS 15 (Sequoia) and macOS 16 (Tahoe) both work; 16 has the best Metal compute scheduler.
- Cloud fallback for Maverick/Behemoth: AWS Bedrock, Azure AI Foundry, or Together AI all host Llama 4 Maverick if your Mac can't.
Toolchain in 2026
- Ollama (v0.6+): the simplest path. One command to pull, serve, and prompt.
- llama.cpp: full control over quantization, GGUF conversion, and server flags. Build via CMake.
- MLX / mlx-lm: Apple's native ML framework. Best throughput on Apple Silicon for Scout-sized models.
- Homebrew: package manager for macOS dependencies.
- Xcode Command Line Tools: needed by llama.cpp and MLX wheels.
- Python 3.12+ (arm64): install via Homebrew or pyenv. Avoid Rosetta builds.
Step-by-step installation
1. Prepare your Mac
Confirm Terminal is running native arm64, not under Rosetta:
uname -m # should print arm64
arch # should print arm64
If it prints x86_64, right-click Terminal in Finder → Get Info → uncheck "Open using Rosetta", then reopen.
Install Xcode Command Line Tools:
xcode-select --install
2. Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
3. Install Python (arm64)
brew install python@3.12
python3 --version # expect 3.12.x
file $(which python3) # expect "Mach-O 64-bit executable arm64"
4. Install Ollama and pull Llama 4
brew install --cask ollama
ollama serve &
# Scout — fits on a 96 GB+ Mac
ollama pull llama4:scout
# Maverick — needs ~384 GB unified memory
ollama pull llama4:maverick
# Inspect tags and digests
ollama list
Run an interactive session:
ollama run llama4:scout
Or hit the OpenAI-compatible HTTP endpoint at http://localhost:11434/v1. For an end-to-end agent stack on top of this, see our OpenClaw + Ollama setup guide for running local AI agents — it walks through wiring Ollama into a tool-using agent loop with retrieval and local code execution.
5. Build llama.cpp (Metal)
llama.cpp removed the legacy make target in late 2025. Use CMake with Metal:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
Binaries land in build/bin/. The current entrypoints are llama-cli (interactive), llama-server (OpenAI-compatible HTTP), and llama-quantize. The old ./main binary is gone.
6. Download Llama 4 weights
Two paths:
- Via Ollama (already done above) — weights live under
~/.ollama/models/. - Direct from Hugging Face for llama.cpp / MLX use. Accept the Llama 4 Community License at
meta-llama/Llama-4-Scout-17B-16E-Instruct, then:
pip install -U "huggingface_hub[cli]"
huggingface-cli login
# GGUF quantizations published by community (e.g. unsloth, bartowski)
huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF \
--include "*Q4_K_M*" --local-dir ~/models/llama4-scout-q4
7. Run inference with llama.cpp
./build/bin/llama-cli \
-m ~/models/llama4-scout-q4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
-ngl 999 \
-c 32768 \
-t 8 \
-p "Summarize the architectural differences between MoE and dense transformers."
-ngl 999 offloads all layers to Metal. -c 32768 sets context length — Scout supports up to 10M tokens but practical local context on 128 GB Macs is 32K–128K before KV cache pressure starts to matter.
For a persistent server:
./build/bin/llama-server \
-m ~/models/llama4-scout-q4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
-ngl 999 -c 65536 --port 8080 \
--cache-type-k q8_0 --cache-type-v q8_0
The --cache-type-* flags use FP8-class KV quantization, cutting KV memory roughly in half on M4/M5 chips with minimal quality loss.
8. (Optional) Run with MLX for best Apple Silicon throughput
pip install mlx-lm
mlx_lm.generate \
--model mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit \
--prompt "Explain mixture-of-experts routing." \
--max-tokens 512
MLX consistently outperforms llama.cpp on Scout on M3 Ultra / M4 Max / M5 Max in the 20–40% tokens/sec range at equivalent quantization, and the mlx-community repo on Hugging Face has 4-bit and 8-bit Llama 4 Scout converts ready to pull.
Quantization cheatsheet
| Format | Quality | Memory cost | When to use |
|---|---|---|---|
| Q3_K_M | Noticeable degradation | ~3.0 bpw | Last resort to fit Maverick on 256 GB |
| Q4_K_M | Near-lossless for chat | ~4.5 bpw | Default for Scout/Maverick on Apple Silicon |
| Q5_K_M | Lossless for most tasks | ~5.5 bpw | Coding/math when you have 128 GB+ headroom |
| Q6_K | Effectively FP16-equivalent | ~6.5 bpw | Long-context evaluation runs |
| FP8 (E4M3) | Indistinguishable from BF16 | ~8 bpw | M4/M5-class chips, throughput-bound serving |
| BF16 | Reference | 16 bpw | Only on 256 GB+ Ultra; rarely worth it locally |
Python integration
Talk to Ollama via its OpenAI-compatible API — no subprocess plumbing needed:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="llama4:scout",
messages=[{"role": "user", "content": "Write a Python function that LRU-caches an async call."}],
temperature=0.2,
)
print(resp.choices[0].message.content)
For MLX:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit")
print(generate(model, tokenizer, prompt="Hello", max_tokens=200))
2026 alternatives to Llama 4 on Mac
If Llama 4 Scout is overkill or won't fit:
- Gemma 4 (Google, 2026 update of Gemma 3) — 9B and 27B dense variants run on 32–48 GB Macs.
- Qwen 3.5 / Qwen 3.6 (Alibaba) — 32B and 72B dense, strong at code and multilingual reasoning. The 32B fits Q4_K_M on a 32 GB MacBook Pro.
- DeepSeek V4 — successor to DeepSeek V3; MoE architecture similar to Llama 4 with 37B active parameters. Cloud-first, but Q4 GGUFs exist.
- Llama 5 — only sensible if you have an M4/M5 Ultra with 256 GB+. Otherwise stick to Scout.
- Hosted: Claude 4.7 via API for tasks that need stronger reasoning than any local Mac model can deliver, or GPT-5.5 via OpenAI for multimodal-heavy workloads.
Troubleshooting
Rosetta and architecture mismatches
If pip install pulls x86_64 wheels, your Python is running under Rosetta. Reinstall via Homebrew with brew install python@3.12 and rebuild your venv.
Ollama "out of memory" or constant swapping
Ollama's default context for Llama 4 is 8192. Bumping it past 32K on a 64 GB machine will swap. Either lower context (OLLAMA_CONTEXT_LENGTH=8192) or quantize the KV cache:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
Metal shader compile errors on first run
Update Xcode CLT (sudo rm -rf /Library/Developer/CommandLineTools && xcode-select --install) and rebuild llama.cpp from clean.
Gatekeeper / developer verification
If macOS quarantines an Ollama or llama.cpp binary, clear the attribute on that file specifically rather than disabling Gatekeeper system-wide:
xattr -d com.apple.quarantine /path/to/binary
Avoid spctl --master-disable; it has been deprecated in macOS 15 and is a blunt instrument.
Thread tuning
On Apple Silicon, set -t to the number of performance cores, not total cores. M3 Max has 12 P-cores; M4 Max has 12; M3 Ultra has 24. Setting -t higher schedules onto efficiency cores and slows inference.
Related on Codersera
- OpenClaw + Ollama setup guide for running local AI agents — pillar guide for wiring local LLMs into agentic workflows.
- Gemma 4 vs Llama 4: Which to Run Locally in 2026 — head-to-head on quality, RAM, and tokens/sec.
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run Teapot LLM on Mac: Installation Guide