Run Qwen3-8B on Mac: 2026 Installation Guide (Ollama, MLX, llama.cpp)
Last updated April 2026.
Qwen3-8B is an 8.2-billion-parameter open-weight large language model from Alibaba's Qwen team that runs comfortably on Apple Silicon Macs with as little as 16 GB of unified memory. It is still actively maintained, ships with first-class support in Ollama, llama.cpp, and Apple's own MLX framework, and remains one of the best "fits in a MacBook" models for general chat, coding, and agentic workflows. This guide walks through three installation paths (Ollama, MLX, and llama.cpp), shows how to size quantization to your RAM, and covers what changed for Mac users in 2026.
What changed in 2026
- Ollama v0.22.0 shipped on April 28, 2026, with continued MLX runner improvements first introduced in earlier 0.21.x releases.
- Apple M5, M5 Pro, and M5 Max MacBook Pros launched on March 11, 2026, adding a per-core Neural Accelerator in the GPU. Apple claims up to 4x AI performance vs. M4-series and up to 8x vs. M1.
- Qwen3.5 (February 2026) and Qwen3.6-27B / 35B-A3B (April 2026) joined the family as larger, MoE-heavy successors. Qwen3-8B is still the recommended pick for 16-24 GB Macs.
- mlx-lm 0.24+ is the Qwen team's officially recommended path for native Apple Silicon inference.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
Want the full picture? Read our continuously-updated Qwen 3.5 complete guide — benchmarks, hardware requirements, and deployment patterns across the Qwen 3 family.
TL;DR
- Easiest: install Ollama, then run
ollama run qwen3:8b. ~5.2 GB download, works on any Apple Silicon Mac with 16 GB+ RAM. - Fastest on Apple Silicon: use
mlx-lmwith an MLX-quantized build of Qwen3-8B from themlx-communityorg on Hugging Face. - Most flexible: use
llama.cppwith a GGUF quantization (Q4_K_M or Q5_K_M is the sweet spot for 16 GB Macs). - Stay on Qwen3-8B for laptops; reach for Qwen3-14B / Qwen3-30B-A3B / Qwen3.6-27B only if you have 32 GB+ unified memory.
What are the system requirements?
- Chip: Apple Silicon (M1, M2, M3, M4, or M5 family). Intel Macs are not supported by Ollama's Metal backend in any meaningful way for 8B models.
- Unified memory: 16 GB minimum for a 4-bit Qwen3-8B, 24 GB+ recommended for headroom or longer contexts.
- Disk space: ~6-10 GB per quantization you keep cached.
- macOS: macOS 14 Sonoma or later (Ollama's current minimum, per the official Mac download page).
- Network: only needed for the initial weight download; inference is fully offline.
Path 1: Install with Ollama (recommended for most people)
1. Install Ollama
The official one-liner from ollama.com/download/mac:
curl -fsSL https://ollama.com/install.sh | shOr download the .dmg from the same page and drag Ollama.app into /Applications. Verify the install:
ollama --version2. Pull and run Qwen3-8B
The default qwen3:8b tag in the Ollama library is a 4-bit quantization weighing 5.2 GB with a 40K-token context window:
ollama run qwen3:8bThe first run downloads the weights; subsequent launches start instantly because the blob is cached under ~/.ollama/models.
3. Use the local API
Ollama exposes an OpenAI-compatible REST API on localhost:11434 while it is running:
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Explain quantum computing in three sentences."
}'4. Manage models
ollama list # show installed models
ollama pull qwen3:14b # grab a bigger sibling
ollama rm qwen3:8b # free disk spacePath 2: Native MLX with mlx-lm (fastest on Apple Silicon)
The Qwen3 README explicitly recommends mlx-lm (version 0.24.0 or newer) for Apple Silicon, and points users to MLX-suffixed builds on Hugging Face.
pip install -U mlx-lm
# Run a 4-bit MLX build of Qwen3-8B from the mlx-community org
mlx_lm.generate \
--model mlx-community/Qwen3-8B-4bit \
--prompt "Write a haiku about unified memory."For an interactive REPL:
mlx_lm.chat --model mlx-community/Qwen3-8B-4bitPath 3: llama.cpp with GGUF (most knobs)
If you want full control over context length, RoPE scaling, KV-cache quantization, or batching, build llama.cpp and grab a GGUF:
brew install llama.cpp
# Download a community Qwen3-8B GGUF (e.g., Q4_K_M) into ./models
# Then run a chat session:
llama-cli -m ./models/Qwen3-8B-Q4_K_M.gguf -p "Hello"llama.cpp on Apple Silicon uses Metal automatically; no extra flags required. Per the official Qwen3-8B model card, the model has a native 32,768-token context that can be stretched to 131,072 tokens via YaRN scaling.
Quantization vs. RAM on Apple Silicon
Approximate on-disk and resident sizes for Qwen3-8B. Resident memory at runtime adds the KV-cache, which scales with context length.
| Quantization | Approx. weights size | Recommended unified memory | Notes |
|---|---|---|---|
| Q4_K_M / 4-bit MLX | ~5 GB | 16 GB | Default Ollama tag; best size/quality balance. |
| Q5_K_M / 5-bit MLX | ~6 GB | 16-24 GB | Noticeably better reasoning vs. Q4 with small RAM cost. |
| Q6_K / 6-bit MLX | ~7 GB | 24 GB+ | Near-FP16 quality; good for long contexts. |
| Q8_0 / 8-bit MLX | ~9 GB | 24-32 GB | Effectively lossless; pick this if you have RAM to spare. |
| FP16 | ~16 GB | 32 GB+ | Reference quality; only on Pro/Max/Ultra/Studio class machines. |
Apple Silicon and local LLMs in 2026
Per Apple's March 3, 2026 newsroom announcement, the M5 Pro and M5 Max MacBook Pros add a Neural Accelerator inside every GPU core, with Apple claiming up to 4x AI performance vs. the previous generation and up to 4x faster LLM prompt processing on M5 Max vs. M4 Max. Unified memory tops out at 64 GB on M5 Pro and 128 GB on M5 Max, which means 8B-class models like Qwen3-8B leave headroom for IDE, browser, and other apps even at higher quantizations.
That said, you do not need an M5 to run this model. Qwen3-8B was designed with consumer hardware in mind and runs well on M1/M2/M3/M4 systems with 16 GB+ unified memory.
Picking a Qwen version in 2026
- Qwen3-8B - the focus of this guide. 8.2B params, 32K native / 131K extended context, fits 16 GB Macs at Q4. Confirmed in the official Hugging Face model card.
- Qwen3-14B - same family, better quality, needs ~10 GB at Q4. Practical on 24 GB+ Macs.
- Qwen3-30B-A3B - MoE variant with only ~3B active parameters per token. Heavier on disk (~19 GB in Ollama) but fast at inference; comfortable on 32 GB Macs.
- Qwen3.5-27B / Qwen3.5-35B-A3B - released February 2026 per huggingface.co/Qwen; heavier still, aimed at workstations.
- Qwen3.6-27B - released April 2026, dense 27B with vision encoder, reported scores on the model card include 77.2% on SWE-bench Verified, 86.2% on MMLU-Pro, and 87.8% on GPQA Diamond. Best run on 64 GB+ Macs at lower quantization.
For most laptops, Qwen3-8B is still the right default. Move up only when you have measured a real bottleneck.
How do you troubleshoot common issues?
- Out of memory / hard reboot: drop one quantization tier (e.g., Q5_K_M to Q4_K_M), shrink the context window via
ollama run qwen3:8b --num-ctx 8192, or close memory hogs (Chrome, Docker, Xcode). model not foundfrom Ollama: double-check the tag at ollama.com/library/qwen3 and confirm network access; corporate proxies often break the manifest fetch.- Slow first token, then fast: normal; that delay is prompt processing. Larger contexts and bigger models magnify it - the M5 Pro/Max specifically target this with the per-core Neural Accelerator.
- MLX import error: ensure
mlx-lm >= 0.24.0, as required by the Qwen3 README. - Ollama refuses to install: verify macOS 14 Sonoma or later, per the official Mac download page.
Going further: agents and tool use
Once Qwen3-8B is running locally, you can wire it into an agent loop. For a Codersera walk-through that pairs Ollama with an open-source coding agent, see our OpenClaw + Ollama setup guide for running local AI agents.
FAQ
Is Qwen3-8B still relevant in 2026 now that Qwen3.5 and Qwen3.6 exist?
Yes. Qwen3.5 and Qwen3.6 are higher-end models (27B-397B) that target workstations and servers. Qwen3-8B remains the best Qwen fit for 16-24 GB MacBooks.
Should I use Ollama, MLX, or llama.cpp?
Ollama is easiest. MLX is generally fastest on Apple Silicon because it uses Metal directly and is built by Apple's ML team. llama.cpp gives the most knobs (KV-cache quantization, custom RoPE, exotic quants).
Can I run Qwen3-8B without Internet?
fter the first download, yes - everything runs locally on the Neural Engine and GPU.
What context length can I expect?
32,768 tokens natively, up to 131,072 via YaRN scaling, per the official Qwen3-8B model card.
Do I need an M5?
No. Any M1 or later with 16 GB+ unified memory will run Qwen3-8B at Q4. M5 Pro / M5 Max simply give faster prompt processing and headroom for larger models.
References
- Qwen3-8B model card on Hugging Face - parameters, context length, quantization variants.
- QwenLM/Qwen3 GitHub repository - official model sizes and mlx-lm guidance.
- Qwen organization on Hugging Face - current Qwen3, Qwen3.5, and Qwen3.6 family listings.
- Qwen3.6-27B model card - April 2026 release, benchmark scores.
- Ollama releases on GitHub - v0.22.0 released April 28, 2026.
- Ollama Qwen3 library page - tag list and download sizes.
- Ollama macOS download page - install methods and macOS 14 Sonoma requirement.
- Apple Newsroom: MacBook Pro with M5 Pro and M5 Max - March 3, 2026 announcement, AI performance claims.
- ml-explore/mlx-lm - official MLX language model toolkit.