Qwen

How to Run Qwen 3.6 Locally: 27B Dense vs 35B MoE (2026 Guide)

Run Qwen 3.6 locally: 27B dense vs 35B-A3B MoE explained, VRAM tables per quant, and copy-paste Ollama, llama.cpp, vLLM, and MLX commands.

Published 18 May 2026 • Updated 18 May 2026 • 10 min read

Quick answer. Run Qwen3.6-35B-A3B (MoE, ~3B active) if you want speed on a 24GB GPU or Apple Silicon — community reports ~120 tok/s on an RTX 4090 at Q4. Run Qwen3.6-27B (dense, all 27B active) if you want maximum coding quality and can spend the throughput. Both are Apache 2.0; Ollama, llama.cpp, vLLM, and MLX all support them.

Qwen 3.6 is the most-discussed open-weight release on r/LocalLLaMA this month, and almost every thread asks the same question: do I run the 27B dense model or the 35B-A3B MoE one? They are from the same family, both fit a 24GB consumer GPU at 4-bit, and both ship under Apache 2.0 — but they behave very differently at inference time. This guide answers the variant question first, then gives you copy-paste setup for Ollama, llama.cpp, vLLM, and MLX on Apple Silicon, with VRAM tables and the throughput numbers people are actually reporting.

All hardware and speed figures below are community-reported or vendor-published and labelled as such. Quantized file sizes vary by builder (Unsloth dynamic quants differ from stock GGUF), so treat every GB figure as a planning estimate, not a guarantee.

What Qwen 3.6 variants exist, and which do you run?

Qwen 3.6 shipped as a family. The two open-weight variants that matter for local use are the 27B dense model and the 35B-A3B MoE model. Both are multimodal hybrid-thinking models with a 256K-class context window and Apache 2.0 licensing.

Variant	Architecture	Total params	Active per token	Best for
Qwen3.6-35B-A3B	Mixture-of-Experts	~35B	~3B	Fast local inference on 24GB GPU / Apple Silicon
Qwen3.6-27B	Dense	~27B	All ~27B	Maximum coding/agentic quality per VRAM

The headline result Qwen published for the 27B dense model: it surpasses the previous-generation open-source flagship Qwen3.5-397B-A17B (397B total / 17B active) across major coding benchmarks — a 27B dense model beating a 397B MoE one (Qwen reports SWE-bench Verified 77.2 vs 76.2, SWE-bench Pro 53.5 vs 50.9, and Terminal-Bench 2.0 59.3 vs 52.5 in its Qwen3.6-27B release blog). That is a vendor-reported claim; treat benchmark wins as directional and validate on your own tasks. For how Qwen 3.6 stacks up against the rest of the field, see our roundup of the best open-source LLMs in 2026 and the head-to-head Gemma 4 vs Qwen3.6 comparison.

Decision rule:

You have a 24GB GPU (RTX 3090/4090) or a 32GB+ Mac and want responsiveness → 35B-A3B. With only ~3B active params it generates tokens roughly 3–5x faster than the 27B dense on identical hardware (community-reported).
You want the strongest coding/agentic output and accept lower tok/s → 27B dense. Every parameter fires on every token, which is slower but is where the flagship coding numbers come from. (If coding is your primary use, the dedicated Qwen3-Coder-Next checkpoint is also worth a look.)
You are CPU-only or RAM-constrained → 35B-A3B. MoE's small active footprint makes CPU/partial-offload far more tolerable than a fully dense model.

How does dense differ from MoE in practice?

A dense model runs every weight for every token. Qwen3.6-27B activates all ~27B parameters on each forward pass — predictable, consistently high quality, but compute scales with the full parameter count.

A Mixture-of-Experts model has many expert sub-networks and a router that picks a few per token. Qwen3.6-35B-A3B holds ~35B total parameters but only activates ~3B per token. The practical consequences for local hosting:

Dimension	27B Dense	35B-A3B MoE
VRAM to hold weights	Lower (27B < 35B)	Higher (must hold all 35B)
Compute per token	High (27B active)	Low (~3B active)
Token generation speed	Slower	~3–5x faster (community-reported)
CPU / partial offload	Painful	Tolerable
Quality ceiling	Flagship coding (vendor claim)	Very strong, slightly behind dense on hardest tasks

The counterintuitive part: the MoE model needs more VRAM to load (all 35B weights must be resident) but runs faster (only ~3B compute per token). The dense model is the opposite — smaller to load, slower to run. Pick based on whether your bottleneck is VRAM capacity or tokens-per-second.

How much VRAM do you need per quant?

These are community-reported figures aggregated from Unsloth GGUF builds and independent VRAM trackers. They cover weights at a modest (4K–32K) context. Long context inflates the KV cache substantially — budget extra headroom if you push toward the full 256K/1M window.

Qwen3.6-27B (dense) — community-reported:

Quant	Approx VRAM	Fits on
Q4_K_M	~17 GB	RTX 4080 16GB (tight), 3090/4090 24GB, M-series 24GB+
Q5_K_M	~19–20 GB	RTX 3090/4090 24GB
Q6_K	~22–23 GB	RTX 3090/4090 24GB (tight), 5090 32GB
Q8_0	~28–29 GB	RTX 5090 32GB, Mac 36GB+
BF16 (full)	~55–56 GB	48GB+ class / multi-GPU / 64GB Mac

Qwen3.6-35B-A3B (MoE) — community-reported:

Quant	Approx VRAM	Fits on
Q4_K_M	~19–22 GB	RTX 3090/4090 24GB
Q5_K_M	~25–27 GB	RTX 5090 32GB, Mac 32GB+
Q6_K	~27–32 GB	RTX 5090 32GB, Mac 36GB+
Q8_0	~37–39 GB	48GB class, Mac 64GB+
BF16 (full)	~69–70 GB	2x 48GB / 80GB / 96GB Mac

Rule of thumb: a single 24GB GPU comfortably runs either model at Q4. A 32GB GPU or 36GB+ Mac opens Q6/Q8 on the 27B and Q5/Q6 on the 35B-A3B. For RAM-only or partial-offload setups, the 35B-A3B is the only one of the two that stays usable.

How do you run Qwen 3.6 with Ollama?

Ollama is the fastest path if you just want a working text endpoint. Both variants have library tags:

# 35B-A3B MoE — recommended default (fast)
ollama pull qwen3.6:35b
ollama run qwen3.6:35b

# 27B dense — higher quality, slower
ollama pull qwen3.6:27b
ollama run qwen3.6:27b

# Pin a specific quant tag
ollama run qwen3.6:27b-q4_K_M

One real caveat to know before you start: at time of writing, the vision (multimodal) path is broken in Ollama for Qwen 3.6 because the model ships its vision projector (mmproj) as a separate file that Ollama's GGUF flow does not wire up. Text generation works fine; if you need image input, use llama.cpp or MLX/MLX-VLM instead. This is a current-state limitation reported by multiple builders and may change as tooling catches up — verify against the Ollama library page for your version.

To call it from code once it is running:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.6:35b",
  "prompt": "Explain MoE vs dense in two sentences.",
  "stream": false
}'

How do you run Qwen 3.6 with llama.cpp?

llama.cpp gives you the most control over quant, context, and offload — and it handles the separate vision projector that breaks Ollama. Build it current first:

# Build with CUDA (set -DGGML_CUDA=OFF for CPU-only or Apple Metal)
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --target llama-cli llama-server

Run the MoE model directly from a Hugging Face GGUF repo (Unsloth dynamic quants shown; --jinja is required so the Qwen chat template is applied):

# 35B-A3B MoE, 4-bit, full GPU offload, 32K context
./llama.cpp/build/bin/llama-cli \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
  --jinja -ngl 99 --ctx-size 32768 \
  --temp 0.7 --top-p 0.8 --top-k 20

# 27B dense, 4-bit, full GPU offload
./llama.cpp/build/bin/llama-cli \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
  --jinja -ngl 99 --ctx-size 32768

For an OpenAI-compatible HTTP server instead of an interactive CLI:

./llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
  --jinja -ngl 99 --ctx-size 32768 --port 8080

Notes: drop -ngl (or lower it) to keep some layers on CPU if you are VRAM-short — the MoE model tolerates this far better than the dense one. Repo tag names like UD-Q4_K_XL vary by builder; check the model card on Hugging Face for the exact quant tags available.

How do you serve Qwen 3.6 with vLLM?

vLLM is the right choice for a multi-user or production-style endpoint with high throughput batching. It serves the full-precision weights (or an AWQ/FP8 quant) and exposes an OpenAI-compatible API:

pip install -U vllm

# Single high-VRAM GPU, MoE model
vllm serve Qwen/Qwen3.6-35B-A3B \
  --max-model-len 262144 \
  --reasoning-parser qwen3

# Multi-GPU tensor parallel (e.g. 8 GPUs)
vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

To enable agentic tool calling — useful if you want to drive it as a free local coding agent — add the parser flags:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 8 --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

Full BF16 weights are large (~55GB for 27B, ~69GB for 35B-A3B), so a single-GPU vLLM deploy of the full model needs an 80GB-class card; otherwise use tensor parallelism or a quantized checkpoint. vLLM is overkill for a single developer on one workstation — use Ollama or llama.cpp there. Our vLLM vs Ollama vs LM Studio production benchmark quantifies exactly where each backend wins.

Companion guide

For the full picture on the Qwen family — benchmarks, variant history, fine-tuning, and where it fits against other open models — see our Qwen complete guide for 2026.

How do you run Qwen 3.6 on Apple Silicon with MLX?

On an M-series Mac, MLX is the fastest backend — it uses unified memory natively, so a 32GB+ Mac handles either model at 4-bit. Use mlx-lm for text and mlx-vlm for vision (MLX handles the separate vision projector, unlike Ollama).

pip install -U mlx-lm

# One-shot generation, 35B-A3B 4-bit
mlx_lm.generate \
  --model mlx-community/Qwen3.6-35B-A3B-4bit \
  --prompt "Explain the difference between MoE and dense models." \
  --max-tokens 512

# 27B dense 4-bit
mlx_lm.generate \
  --model mlx-community/Qwen3.6-27B-4bit \
  --prompt "Write a Python LRU cache." \
  --max-tokens 512

For an interactive chat session or a local server:

# Interactive chat
mlx_lm.chat --model mlx-community/Qwen3.6-35B-A3B-4bit

# OpenAI-compatible server
mlx_lm.server --model mlx-community/Qwen3.6-35B-A3B-4bit --port 8080

Look for repos under the mlx-community org on Hugging Face ending in -4bit / -6bit / -8bit. On Apple Silicon, the 35B-A3B MoE is the more comfortable pick — the small active footprint keeps token generation snappy even on a base M-series Pro.

What performance should you expect?

These are community-reported numbers from individual benchmarks (Medium write-ups, Hugging Face discussions, independent blogs) — not vendor figures, and highly config-dependent. Treat them as ranges, not guarantees.

Variant / hardware	Reported throughput	Notes
35B-A3B Q4 on RTX 4090	~120+ tok/s	MoE's small active footprint; community-reported
27B Q4 on RTX 4090 (baseline)	~43 tok/s	Stock config, full dense compute
27B Q4 on RTX 4090 (tuned)	~122–154 tok/s	With speculative decoding / batching; single-report claims
35B-A3B on 64GB M-series Mac	Usable interactive	MLX 4-bit; community-reported

The pattern is consistent across reports: the MoE model is dramatically faster out of the box, while the dense model needs tuning (speculative decoding, draft models, batching) to close the gap. If you have not invested in an optimized inference config, expect the 35B-A3B to feel noticeably more responsive day to day.

How do you troubleshoot common Qwen 3.6 issues?

Vision/image input fails in Ollama. Expected current-state behaviour — the mmproj projector ships separately and Ollama does not wire it up yet. Use llama.cpp (load the projector explicitly) or MLX-VLM for multimodal.
Out-of-memory at load. The 35B-A3B must hold all 35B weights resident; if Q4 OOMs on a 24GB card, lower context size first, then drop a quant level, then offload layers to CPU with a lower -ngl in llama.cpp.
OOM only at long context. The KV cache grows with context length and can add tens of GB at the full 256K/1M window. Reduce --ctx-size / --max-model-len to what you actually need.
Garbled or broken chat output in llama.cpp. You almost certainly forgot --jinja. The Qwen chat template must be applied or the model receives malformed prompts.
Slower than expected on the 27B dense. That is inherent — all 27B params fire per token. Either switch to 35B-A3B for speed or invest in speculative decoding / a draft model to recover throughput.
Quant tag not found. Builder tag names differ (stock Q4_K_M vs Unsloth UD-Q4_K_XL). Open the specific Hugging Face GGUF repo and read the available tags before pulling.

Is Qwen 3.6 free for commercial use?

The open-weight variants — Qwen3.6-27B and Qwen3.6-35B-A3B — are released under Apache 2.0, which permits commercial use, modification, and redistribution with attribution and the standard patent/notice terms. Note that Qwen 3.6 also introduced some proprietary (API-only, non-open) variants in the family; those are not Apache 2.0. For self-hosting, stick to the two open-weight checkpoints above and verify the LICENSE file on the specific Hugging Face repo you download — license terms are the one thing you should never take from a blog (including this one) without confirming at the source.

Who can help you deploy local LLMs in production?

Getting Qwen 3.6 running on a laptop is an afternoon. Getting a quantized MoE model serving a team reliably — with the right vLLM/llama.cpp config, KV-cache budgeting, autoscaling, and observability — is real infrastructure work. If you are hiring vetted remote developers experienced with local LLM deployment, inference optimization, and self-hosted model infrastructure, codersera.com/hire matches you with engineers who have shipped exactly this in production, with a risk-free trial so you can validate technical fit before committing.

FAQ

Should I run Qwen3.6-27B or 35B-A3B locally?

Default to 35B-A3B (MoE) for a 24GB GPU or Apple Silicon — it generates tokens roughly 3–5x faster because only ~3B of its parameters are active per token (community-reported). Choose the 27B dense model when you want the strongest coding/agentic output and can accept lower throughput, since every parameter fires on every token.

Why does the MoE model need more VRAM but run faster?

All 35B MoE weights must be resident in memory even though only ~3B are used per token, so it needs more VRAM to load than the 27B dense model. But compute per token scales with the ~3B active params, not the 35B total — so it runs faster. The dense model is the inverse: smaller to load, heavier to run.

Can I run Qwen 3.6 on a 24GB GPU?

Yes. An RTX 3090 or 4090 (24GB) comfortably runs either variant at Q4: ~17GB for the 27B and ~19–22GB for the 35B-A3B at modest context (community-reported). Long context inflates the KV cache and can push you over 24GB, so keep --ctx-size to what you need or drop a quant level.

Does Qwen 3.6 vision work in Ollama?

Not reliably at time of writing. Qwen 3.6 ships its vision projector (mmproj) as a separate file, and Ollama's GGUF flow does not wire it up, so multimodal input fails while text generation works fine. Use llama.cpp (with the projector loaded explicitly) or MLX-VLM on Apple Silicon for image input. Re-check the Ollama library page, as tooling support changes.

What context length does Qwen 3.6 support?

The open-weight Qwen 3.6 models have a 256K-class native context (Qwen3.6-27B is documented at 262,144 tokens natively, extensible toward ~1M with scaling techniques). In practice, the full window costs a large KV cache — budget tens of GB of extra memory and only raise the context limit to what your workload actually requires.

Is Qwen 3.6 good for coding compared to larger models?

Qwen states the 27B dense model surpasses its previous open-source flagship Qwen3.5-397B-A17B (397B total / 17B active) across major coding benchmarks — a 27B model beating a 397B one. That is a vendor-reported claim; benchmark wins are directional, so validate on your own codebase and tasks before standardizing on it.

Which backend is fastest for Qwen 3.6?

For a single developer: Ollama is the simplest, llama.cpp gives the most control (and handles vision). For Apple Silicon, MLX is fastest because it uses unified memory natively. For a multi-user or production endpoint, vLLM delivers the highest aggregate throughput via batching and tensor parallelism. Match the backend to whether you are optimizing for one user's latency or many users' throughput.