Run Qwen3-Next-80B-A3B on macOS: Step-by-Step 2026 Guide

Quick answer. Run Qwen3-Next-80B-A3B on Apple Silicon by installing MLX-LM and loading the halley-ai 4-bit MLX build; you need roughly 42 GB free SSD and 48 GB unified RAM, with 64 GB recommended. For full-precision serving with multi-token prediction, use SGLang 0.5.2+ or vLLM 0.10.2+. Transformers must currently be installed from main to avoid a qwen3_next KeyError.

Last updated: May 1, 2026.

What changed in this 2026 refresh: Refreshed for current tool versions (MLX-LM, SGLang >=0.5.2, vLLM >=0.10.2, Ollama 0.19 MLX backend) and added Qwen3.5/3.6 successor context for May 2026.

This guide walks through running Qwen3-Next-80B-A3B locally on Apple Silicon Macs in 2026, using MLX-LM for the 4-bit quantized build and SGLang or vLLM for full-precision serving with Multi-Token Prediction (MTP). Every command, version pin, and config block has been verified against the model's Hugging Face card and the upstream serving frameworks as of May 2026.

What changed in 2026: Qwen3-Next-80B-A3B (released September 2025) is still production-current and not deprecated, but Alibaba shipped Qwen3.5 (Feb 2026) and Qwen3.6 (April 22, 2026) as successor families. SGLang now requires >=0.5.2, vLLM >=0.10.2, and Transformers must be installed from main until a tagged release lands — earlier builds throw KeyError: 'qwen3_next'. Ollama 0.19 (March 2026) added an experimental MLX backend on Apple Silicon, giving you another viable local-serving path. If you want a head-to-head local-agent setup that wires Ollama into a desktop UI, see our OpenClaw + Ollama setup guide for running local AI agents.

Want the full picture? Read our continuously-updated Qwen 3.5 Complete Guide (2026) — flavors, licensing, benchmarks, and on-device usage.

TL;DR

  • Model: Qwen3-Next-80B-A3B (Instruct or Thinking variant) — 80B total, ~3B active per token via ultra-sparse MoE.
  • Apple Silicon path: pip install mlx-lm, then load halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64. Needs ~42 GB free SSD and ≥48 GB unified RAM (64 GB+ recommended).
  • Server path: SGLang >=0.5.2 or vLLM >=0.10.2 with --speculative-algo NEXTN (SGLang) or qwen3_next_mtp (vLLM) for multi-token prediction.
  • Context: 262K native, extendable to ~1M via YaRN RoPE scaling.
  • 2026 alternatives if you want newer: Qwen3.6-27B (dense, lighter), Qwen3.5-35B-A3B (drop-in MoE upgrade), Qwen3.5-122B-A10B (heavier MoE).

Model overview: what Qwen3-Next-80B-A3B actually is

Qwen3-Next-80B-A3B is an ultra-sparse Mixture-of-Experts model from Alibaba's Qwen team, released on September 11, 2025. The "A3B" suffix signals that only ~3B parameters activate per token despite the 80B total — a roughly 3.75% activation ratio, much sparser than Mixtral-style MoE.

Architecture highlights

  • Hybrid attention: alternating Gated DeltaNet blocks and Gated Attention blocks. DeltaNet handles long-range linearly; gated attention preserves precision on local context.
  • Ultra-sparse MoE: 512 routed experts plus shared experts; top-10 routing keeps activated parameters near 3B.
  • Native Multi-Token Prediction (MTP) head: the model architecture itself emits speculative drafts, so MTP isn't just a serving trick — it's baked in. SGLang and vLLM both expose this directly.
  • Pre-training: 15T tokens, with reasoning, code, and multilingual emphasis (119 languages).
  • Context: 262,144 tokens native, extendable to ~1,010,000 via YaRN RoPE scaling.

Instruct vs. Thinking variants

  • Qwen3-Next-80B-A3B-Instruct — direct instruction-following, no chain-of-thought scaffolding. Best for chat, coding, agents, and most product workloads.
  • Qwen3-Next-80B-A3B-Thinking — emits an explicit reasoning trace before the final answer. Use when you want visible deliberation (math, multi-hop reasoning, evals).

Why run it locally on macOS

  • Privacy: prompts and outputs stay on-device — no third-party logs.
  • Latency: first-token latency on M3 Max / M4 Max sits well below typical hosted API round-trips for short prompts.
  • Cost: free after the hardware. Hosted Qwen3-Next-80B-A3B-Instruct on OpenRouter and similar still bills per token.
  • Unified memory: MLX maps the entire 4-bit weight set into unified RAM, so no CPU↔GPU transfer overhead — the bottleneck most x86 setups hit.

System requirements

ComponentMinimumRecommended (May 2026)
macOS13.5 VenturamacOS 15 Sequoia or 16 (current)
ChipApple Silicon M1M3 Max / M4 Max / M5 Pro+
Unified RAM48 GB64–128 GB
Free SSD~42 GB (4-bit MLX build)≥80 GB if you also keep the BF16 weights
Python3.103.11 or 3.12
MLX-LM0.20+Latest from PyPI (tracks main)
Transformersgit+https://github.com/huggingface/transformersSame — Qwen3-Next still requires main as of May 2026

Intel Macs: not supported for MLX. You can fall back to llama.cpp with a GGUF Qwen3-Next conversion, but throughput will be a fraction of MLX on Apple Silicon and the long-context path is currently unreliable.


Installation and setup

Step 1: prerequisites

brew update
brew install python@3.12 git
python3.12 -m pip install --upgrade pip setuptools wheel

Step 2: isolated environment

python3.12 -m venv ~/.qwen3_env
source ~/.qwen3_env/bin/activate

Step 3: install MLX-LM

pip install --upgrade mlx-lm

If you also want to experiment with the BF16 weights via Transformers, install from main — Qwen3-Next is still not in a tagged release as of May 2026:

pip install --upgrade "git+https://github.com/huggingface/transformers.git@main"

Step 4: load the 4-bit MLX build

The community-maintained 4-bit quantization (group size 64) is the recommended Apple Silicon build:

from mlx_lm import load, generate

model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64")

output = generate(
    model,
    tokenizer,
    prompt="Explain the Chudnovsky algorithm for computing pi.",
    max_tokens=256,
)
print(output)

Step 5: run from the CLI

mlx_lm generate \
  --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64 \
  --prompt "What is the capital of France?" \
  --max-kv-size 4096 \
  --max-tokens 256

Server deployment: SGLang and vLLM

MLX is the simplest path on Mac, but SGLang and vLLM expose the model's native MTP head and OpenAI-compatible HTTP API — useful when wiring Qwen3-Next into agents, IDE plugins, or your own backend. Both require CUDA for full speed; on macOS they run CPU-only and are mostly useful for testing the API surface before moving the workload to a GPU box.

SGLang (≥ 0.5.2)

pip install 'sglang[all]>=0.5.2'

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
  --port 30000 \
  --context-length 262144 \
  --mem-fraction-static 0.8 \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

vLLM (≥ 0.10.2)

pip install 'vllm>=0.10.2'

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --port 8000 \
  --max-model-len 262144 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Note on MTP in 2026: a recent vLLM patch fixed an accuracy bug where Qwen3-Next-MTP produced incorrect outputs under batched inference. Make sure you're on the latest 0.10.x or newer — pin loosely (vllm>=0.10.2) and let pip pull the current patch release. For low-concurrency latency-sensitive setups, run with num_speculative_tokens: 1 and disable prefix caching; for higher concurrency, drop MTP entirely and rely on continuous batching.

Ollama with MLX backend (March 2026+)

Ollama 0.19 added an experimental MLX backend for Apple Silicon Macs with ≥32 GB unified memory. It bypasses llama.cpp and runs MLX directly. If a Qwen3-Next-80B-A3B MLX build is published under ollama.com/library/qwen3-next, you can try:

ollama pull qwen3-next:80b-a3b-mlx
ollama run qwen3-next:80b-a3b-mlx

Verify the tag exists before running — the MLX-tagged variants ship behind the standard tags and the catalog moves week-to-week.


Extending context to ~1M tokens with YaRN

262K is the native ceiling. To push beyond it, edit config.json in the local model directory and add a rope_scaling block:

{
  "rope_scaling": {
    "rope_type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 262144
  }
}

Restart the server with --max-model-len 1010000 (vLLM) or --context-length 1010000 (SGLang). Two practical caveats:

  • Quality degrades past ~600K tokens in our testing — needle-in-haystack recall drops noticeably above that.
  • Memory scales roughly linearly with KV cache size. 1M context with the BF16 weights needs a multi-GPU server; the 4-bit MLX build on a 128 GB M3/M4 Max can hold ~400–500K context comfortably.

Practical examples

Example 1: API documentation generation

from mlx_lm import load, generate

model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64")

prompt = """You are an expert API documentation writer.
Generate Markdown docs for this Python function:

def process_image_batch(images, resize, enhance=False):
    \"\"\"Resize a batch of PIL images and optionally enhance them.\"\"\"
    ...

Include: description, parameters with types, return value, example usage.
"""

print(generate(model, tokenizer, prompt=prompt, max_tokens=350, temperature=0.2))

Example 2: local coding agent loop

Qwen3-Next is strong at tool-calling and multi-step planning, which makes it a credible local backbone for an agent loop. Pair it with a structured output parser (JSON Schema or Pydantic) and a small set of tools:

from mlx_lm import load, generate

model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64")

system = """You are a coding agent. Respond ONLY with JSON of the form:
{"tool": "shell|read_file|write_file|done", "args": {...}, "rationale": "..."}"""

user = "Find every TODO in src/ and write a summary to TODOS.md."

prompt = f"<|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n"
print(generate(model, tokenizer, prompt=prompt, max_tokens=400, temperature=0.0))

For a fuller agent stack (UI, tool wiring, sandbox), the pattern in our OpenClaw + Ollama setup guide swaps cleanly between Qwen3-Next, Qwen3.6, and Llama-class backends.


2026 alternatives worth considering

ModelReleasedWhen to prefer it
Qwen3.6-27BApril 22, 2026Dense 27B with vision support; fits in ~24 GB at 4-bit. Faster on Mac, easier deploy. Use when 80B MoE is overkill.
Qwen3.5-35B-A3BFebruary 24, 2026Successor MoE with same 3B active footprint as Qwen3-Next-80B but smaller total. Slightly weaker on reasoning, much smaller on disk (~18 GB at 4-bit).
Qwen3.5-122B-A10BFebruary 24, 2026Heavier MoE, 10B active. Better reasoning ceiling than Qwen3-Next-80B-A3B but needs ~70 GB free SSD and 96 GB+ unified RAM at 4-bit.
Qwen3.5-397B-A17BFebruary 16, 2026Frontier-tier MoE. Not realistic on a single Mac; deploy on a multi-GPU server.
DeepSeek V42026Stronger code/reasoning at similar active size; consider for code-heavy agents. No first-class MLX build yet.
Gemma 42026Best in the small-dense category for on-device. Not a Qwen3-Next replacement — different size class.

Qwen3-Next-80B-A3B is still the best specific trade-off for "I want a frontier-quality MoE on a 64–128 GB Mac" — the newer Qwen3.5/3.6 dense models are smaller, and the larger Qwen3.5 MoEs need more RAM than most Macs ship with. If you're starting fresh in mid-2026 and want the latest architecture, Qwen3.6-27B is the easiest on-ramp.


Troubleshooting

  • KeyError: 'qwen3_next' on import — your Transformers is too old. Reinstall from main: pip install --upgrade "git+https://github.com/huggingface/transformers.git@main".
  • OOM during load — close other apps, then verify Python actually has access to your unified RAM (sysctl hw.memsize). On 32 GB Macs the 4-bit build is borderline; expect frequent swap.
  • Garbled output under MTP — known fixed bug; upgrade vLLM to the latest 0.10.x. If it persists, drop num_speculative_tokens to 1 or disable MTP entirely.
  • Long-context crashes (>200K) — reduce --max-kv-size, lower YaRN factor, or check that rope_scaling was actually written to config.json (some loaders silently ignore it).
  • Slow first token — first call compiles Metal kernels; second call onward should be 5–10× faster. Don't benchmark on a cold load.