macos

Running Llama 4 on Mac: 2026 Installation Guide (Scout, Maverick, MLX & Ollama)

Published 08 Apr 2025 • Updated 31 May 2026 • 7 min read

Quick answer. Llama 4 Scout (109B MoE / 17B active) runs on Apple Silicon Macs with 32 GB+ unified memory via Ollama or MLX. Maverick (400B / 17B active) needs 96 GB+ for Q4_K_M. M3/M4 Max with MLX beats llama.cpp Metal by 20-40% tok/s at the same quant.

Last updated April 2026 — refreshed for current model/tool versions.

Meta's Llama 4 family — Scout (109B MoE / 17B active), Maverick (400B MoE / 17B active), and the still-internal Behemoth (~2T) — remains the most capable open-weight, natively multimodal stack you can run on a Mac in April 2026. Scout in particular fits comfortably on Apple Silicon at 4-bit quantization and is the practical default for local inference on macOS.

This guide walks through the current toolchain (Ollama, llama.cpp, MLX), the actual RAM you need for each Llama 4 variant, the right quantization (Q4_K_M, Q5_K_M, FP8) for each Mac tier, and the exact commands that work today.

Want the full picture? Read our continuously-updated Llama 4 Complete Guide (2026) — Scout and Maverick variants, MoE architecture, and deployment patterns.

What changed in 2026

Llama 4 Behemoth is still not publicly released as of April 2026. Meta continues to use it as a teacher model. Do not plan a local run around it.
Ollama tags consolidated. llama4:scout (alias of llama4:16x17b, ~67 GB at Q4_K_M) and llama4:maverick (alias of llama4:128x17b, ~245 GB at Q4_K_M) are the two real local options.
MLX is now first-class on Apple Silicon. For Scout on M3 Ultra / M4 Max / M5 Max, MLX often beats llama.cpp Metal by 20–40% tokens/sec at the same quantization.
llama.cpp build switched to CMake. The old make target was removed in late 2025; use cmake -B build.
Quantization defaults moved. Q4_K_M is the new community baseline; FP8 KV-cache is supported on M4/M5-class chips and roughly halves KV memory.

Hardware requirements (April 2026)

Llama 4 is a Mixture-of-Experts model. Only 17B parameters are active per token, but you must hold all expert weights in unified memory. Plan for roughly 70–75% of your unified memory being usable for model weights — macOS, the runtime, and KV cache claim the rest.

Model	Quantization	On-disk	Recommended unified RAM	Realistic Mac
Llama 4 Scout (16x17B)	Q4_K_M	~67 GB	96 GB	M3 Max 128 GB, M4 Max 128 GB, M3/M4 Ultra
Llama 4 Scout (16x17B)	Q5_K_M	~78 GB	128 GB	M3 Ultra 192 GB, M4 Ultra 256 GB
Llama 4 Scout (16x17B)	FP8	~109 GB	192 GB	M3/M4 Ultra 192 GB+, M5 Ultra
Llama 4 Maverick (128x17B)	Q4_K_M	~245 GB	384 GB	M3 Ultra 512 GB Mac Studio, M4 Ultra 512 GB
Llama 4 Maverick (128x17B)	Q3_K_M	~190 GB	256 GB	M3 Ultra 256 GB+ (tight)
Llama 4 Behemoth	—	—	not publicly released	cloud only via partners

Other requirements

Chip: Apple Silicon (M-series). Intel Macs are no longer practical for Llama 4 — neither MLX nor Ollama's Metal backend targets them well in 2026.
Disk: 80 GB free for Scout Q4_K_M, 260 GB for Maverick Q4_K_M. Use the internal SSD; external Thunderbolt drives bottleneck cold-start mmap.
macOS: 14.5 (Sonoma) or newer. macOS 15 (Sequoia) and macOS 16 (Tahoe) both work; 16 has the best Metal compute scheduler.
Cloud fallback for Maverick/Behemoth: AWS Bedrock, Azure AI Foundry, or Together AI all host Llama 4 Maverick if your Mac can't.

Toolchain in 2026

Ollama (v0.6+): the simplest path. One command to pull, serve, and prompt.
llama.cpp: full control over quantization, GGUF conversion, and server flags. Build via CMake.
MLX / mlx-lm: Apple's native ML framework. Best throughput on Apple Silicon for Scout-sized models.
Homebrew: package manager for macOS dependencies.
Xcode Command Line Tools: needed by llama.cpp and MLX wheels.
Python 3.12+ (arm64): install via Homebrew or pyenv. Avoid Rosetta builds.

Step-by-step installation

1. Prepare your Mac

Confirm Terminal is running native arm64, not under Rosetta:

uname -m   # should print arm64
arch       # should print arm64

If it prints x86_64, right-click Terminal in Finder → Get Info → uncheck "Open using Rosetta", then reopen.

Install Xcode Command Line Tools:

xcode-select --install

2. Install Homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

3. Install Python (arm64)

brew install python@3.12
python3 --version   # expect 3.12.x
file $(which python3)   # expect "Mach-O 64-bit executable arm64"

4. Install Ollama and pull Llama 4

brew install --cask ollama
ollama serve &

# Scout — fits on a 96 GB+ Mac
ollama pull llama4:scout

# Maverick — needs ~384 GB unified memory
ollama pull llama4:maverick

# Inspect tags and digests
ollama list

Run an interactive session:

ollama run llama4:scout

Or hit the OpenAI-compatible HTTP endpoint at http://localhost:11434/v1. For an end-to-end agent stack on top of this, see our OpenClaw + Ollama setup guide for running local AI agents — it walks through wiring Ollama into a tool-using agent loop with retrieval and local code execution.

5. Build llama.cpp (Metal)

llama.cpp removed the legacy make target in late 2025. Use CMake with Metal:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

Binaries land in build/bin/. The current entrypoints are llama-cli (interactive), llama-server (OpenAI-compatible HTTP), and llama-quantize. The old ./main binary is gone.

6. Download Llama 4 weights

Two paths:

Via Ollama (already done above) — weights live under ~/.ollama/models/.
Direct from Hugging Face for llama.cpp / MLX use. Accept the Llama 4 Community License at meta-llama/Llama-4-Scout-17B-16E-Instruct, then:

pip install -U "huggingface_hub[cli]"
huggingface-cli login

# GGUF quantizations published by community (e.g. unsloth, bartowski)
huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF \
  --include "*Q4_K_M*" --local-dir ~/models/llama4-scout-q4

7. Run inference with llama.cpp

./build/bin/llama-cli \
  -m ~/models/llama4-scout-q4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
  -ngl 999 \
  -c 32768 \
  -t 8 \
  -p "Summarize the architectural differences between MoE and dense transformers."

-ngl 999 offloads all layers to Metal. -c 32768 sets context length — Scout supports up to 10M tokens but practical local context on 128 GB Macs is 32K–128K before KV cache pressure starts to matter.

For a persistent server:

./build/bin/llama-server \
  -m ~/models/llama4-scout-q4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
  -ngl 999 -c 65536 --port 8080 \
  --cache-type-k q8_0 --cache-type-v q8_0

The --cache-type-* flags use FP8-class KV quantization, cutting KV memory roughly in half on M4/M5 chips with minimal quality loss.

8. (Optional) Run with MLX for best Apple Silicon throughput

pip install mlx-lm
mlx_lm.generate \
  --model mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit \
  --prompt "Explain mixture-of-experts routing." \
  --max-tokens 512

MLX consistently outperforms llama.cpp on Scout on M3 Ultra / M4 Max / M5 Max in the 20–40% tokens/sec range at equivalent quantization, and the mlx-community repo on Hugging Face has 4-bit and 8-bit Llama 4 Scout converts ready to pull.

Quantization cheatsheet

Format	Quality	Memory cost	When to use
Q3_K_M	Noticeable degradation	~3.0 bpw	Last resort to fit Maverick on 256 GB
Q4_K_M	Near-lossless for chat	~4.5 bpw	Default for Scout/Maverick on Apple Silicon
Q5_K_M	Lossless for most tasks	~5.5 bpw	Coding/math when you have 128 GB+ headroom
Q6_K	Effectively FP16-equivalent	~6.5 bpw	Long-context evaluation runs
FP8 (E4M3)	Indistinguishable from BF16	~8 bpw	M4/M5-class chips, throughput-bound serving
BF16	Reference	16 bpw	Only on 256 GB+ Ultra; rarely worth it locally

Python integration

Talk to Ollama via its OpenAI-compatible API — no subprocess plumbing needed:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="llama4:scout",
    messages=[{"role": "user", "content": "Write a Python function that LRU-caches an async call."}],
    temperature=0.2,
)
print(resp.choices[0].message.content)

For MLX:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit")
print(generate(model, tokenizer, prompt="Hello", max_tokens=200))

2026 alternatives to Llama 4 on Mac

If Llama 4 Scout is overkill or won't fit:

Gemma 4 (Google, 2026 update of Gemma 3) — 9B and 27B dense variants run on 32–48 GB Macs.
Qwen 3.5 / Qwen 3.6 (Alibaba) — 32B and 72B dense, strong at code and multilingual reasoning. The 32B fits Q4_K_M on a 32 GB MacBook Pro.
DeepSeek V4 — successor to DeepSeek V3; MoE architecture similar to Llama 4 with 37B active parameters. Cloud-first, but Q4 GGUFs exist.
Hosted: Claude 4.7 via API for tasks that need stronger reasoning than any local Mac model can deliver, or GPT-5.5 via OpenAI for multimodal-heavy workloads.

Troubleshooting

Rosetta and architecture mismatches

If pip install pulls x86_64 wheels, your Python is running under Rosetta. Reinstall via Homebrew with brew install python@3.12 and rebuild your venv.

Ollama "out of memory" or constant swapping

Ollama's default context for Llama 4 is 8192. Bumping it past 32K on a 64 GB machine will swap. Either lower context (OLLAMA_CONTEXT_LENGTH=8192) or quantize the KV cache:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Metal shader compile errors on first run

Update Xcode CLT (sudo rm -rf /Library/Developer/CommandLineTools && xcode-select --install) and rebuild llama.cpp from clean.

Gatekeeper / developer verification

If macOS quarantines an Ollama or llama.cpp binary, clear the attribute on that file specifically rather than disabling Gatekeeper system-wide:

xattr -d com.apple.quarantine /path/to/binary

Avoid spctl --master-disable; it has been deprecated in macOS 15 and is a blunt instrument.

Thread tuning

On Apple Silicon, set -t to the number of performance cores, not total cores. M3 Max has 12 P-cores; M4 Max has 12; M3 Ultra has 24. Setting -t higher schedules onto efficiency cores and slows inference.

OpenClaw + Ollama setup guide for running local AI agents — pillar guide for wiring local LLMs into agentic workflows.
Gemma 4 vs Llama 4: Which to Run Locally in 2026 — head-to-head on quality, RAM, and tokens/sec.
Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
Run Teapot LLM on Mac: Installation Guide