Running Qwen3-8B on Windows in 2026: The Complete Ollama and llama.cpp Guide

Quick answer. Qwen3-8B remains the best local LLM for Windows machines with 8-16 GB of VRAM in 2026: 8.2B parameters, 32K context (131K with YaRN), Apache 2.0, and a thinking-mode toggle. Use Ollama 0.22 for one-command setup (ollama run qwen3:8b, 5.2 GB Q4_K_M) or llama.cpp build b8981 if you need fine control over quantization and sampling.

Last updated April 2026 — refreshed for current model/tool versions.

Qwen3-8B is still one of the most capable open-weight 8-billion-parameter LLMs you can run locally on a Windows PC: 8.2B parameters, 32K native context (extendable to 131K with YaRN), Apache 2.0 license, and a built-in thinking/non-thinking mode toggle. This guide gives you the exact 2026-current install path on Windows 10/11 — Ollama for the simple route, llama.cpp for the advanced one — with the VRAM math, real benchmark numbers, and the gotchas you only learn after you break something.

If you are evaluating Qwen3-8B as part of a larger local-agent stack, the OpenClaw + Ollama setup guide for running local AI agents is the companion pillar — it covers the orchestration layer that actually calls a model like this one.

What changed in 2026Newer Qwen models exist, but 8B is still the right pick for ~8–16 GB VRAM. Qwen3.5-27B (Feb 2026) and Qwen3.6-27B (Apr 2026) are dense 27B vision-language models — they need a 24 GB card minimum and bigger ones for full context. Qwen3-8B remains the practical default for laptop/desktop GPUs. Note: there is no Qwen3.5-9B release from Alibaba; rumours of a 9B variant in this family are misinformation. The 3.5/3.6 family starts at 27B.Ollama 0.22.0 (Apr 28, 2026) is the current stable on Windows. The installer no longer needs Administrator rights and installs to %LOCALAPPDATA%\Programs\Ollama. Native NVIDIA CUDA and AMD ROCm builds ship in the same installer.llama.cpp build b8981 (Apr 29, 2026) is the current head. Qwen3 has been first-class since b5092; recent work has been on KV-cache quantization, FlashAttention robustness on WebGPU, and reasoning-budget samplers.Ollama image for qwen3:8b is 5.2 GB on disk (Q4_K_M, the default), with 8.19B reported parameter count. ollama run qwen3:8b just works on a 6–8 GB card.Known issue: Qwen3.5/3.6 GGUFs are not usable inside Ollama yet because the loader doesn't handle the mmproj sidecar for vision. If you need 3.5/3.6 today, run llama.cpp directly. Qwen3-8B (text-only) is unaffected.Reference list rebuilt — the original post had dead numeric reference markers. All citations are now linked.

Want the full picture? Read our continuously-updated Qwen 3.5 Complete Guide (2026) — flavors, licensing, benchmarks, and on-device usage.

TL;DR

QuestionAnswer
Is Qwen3-8B still relevant in April 2026?Yes. Best open 8B-class model for general reasoning, coding, and 100+ languages on consumer hardware.
Easiest way to run it on Windows?ollama run qwen3:8b after installing Ollama 0.22.0.
Minimum VRAM (Q4_K_M)?~6 GB. Fits an 8 GB RTX 3050/4060.
Native context window32,768 tokens; up to 131,072 with YaRN.
Throughput on RTX 4090?~104 tok/s decode at Q4_K_M (community llama.cpp benches).
Should I jump to Qwen3.5/3.6 instead?Only if you have 24 GB+ VRAM and want vision-language. Otherwise stay on 3-8B.

Overview of Qwen3-8B

Qwen3-8B is a dense, decoder-only causal language model in the Qwen3 series from the Alibaba Qwen team, released under Apache 2.0. Specs from the official Hugging Face model card:

  • Parameters: 8.2B total, 6.95B non-embedding
  • Layers: 36
  • Attention: Grouped-query, 32 Q heads / 8 KV heads
  • Native context: 32,768 tokens
  • Extended context: 131,072 tokens with YaRN scaling
  • Modes: Switchable thinking (chain-of-thought wrapped in <think>…</think>) and non-thinking, controllable per-turn with /think or /no_think in the prompt
  • Languages: 100+ via multilingual pretraining
  • Tool use: Native function-calling, MCP-friendly

The thinking-mode toggle is the headline feature. Same weights, two behaviours: deep reasoning when you need it, fast direct answers when you don't.

System requirements

Hardware

QuantizationVRAM (model only)Practical cardNotes
FP16 (full)~16 GBRTX 4090 / 5090 / A6000Plus headroom for KV cache at long context.
Q8_0~10.6 GBRTX 4070 Ti (12 GB)Near-lossless quality.
Q4_K_M (default)~6 GBRTX 3060 / 4060 (8 GB)Sweet spot — what Ollama ships.
Q3_K_S / Q2_K~4 GBRTX 3050 / mobile 4050Quality drops noticeably; only if you must.

Numbers are for model weights only. Long-context inference adds KV-cache cost roughly proportional to sequence length, head count, and KV head count — at 32K context, expect another 1–2 GB on top of the figures above. For 131K-token YaRN runs, plan on 24 GB+ even at Q4_K_M.

CPU-only: works at Q4_K_M with 16 GB RAM; expect ~3–6 tok/s on a modern Ryzen 7 / Core i7. Fine for chat, painful for code generation.

Software

  • Windows 10 22H2 or Windows 11, 64-bit (Home or Pro both fine)
  • NVIDIA driver 452.39 or newer for CUDA, or current AMD driver for ROCm
  • Ollama 0.22.0+ (recommended) or a recent llama.cpp build (b5092+ for Qwen3 support; b8981 is current as of April 2026)
  • ~10 GB free disk for the Q4_K_M GGUF and Ollama runtime
  • Optional: Docker Desktop if you want a web UI like Open WebUI

Route 1 — Ollama (the easy path)

This is the right path if you want to be talking to Qwen3-8B in under five minutes.

1. Install Ollama on Windows

  1. Download OllamaSetup.exe from https://ollama.com/download/windows.
  2. Run the installer. No Administrator prompt — it installs per-user under %LOCALAPPDATA%\Programs\Ollama and adds itself to PATH.
  3. Open a new PowerShell window (so the updated PATH is picked up) and verify:
ollama --version

Optional but recommended: relocate model storage off the system drive. Models can be 5–60 GB each, and the default location is your user profile.

  1. Open Settings → System → About → Advanced system settings → Environment Variables.
  2. Add a user variable OLLAMA_MODELS pointing at e.g. D:\ollama-models.
  3. Restart Ollama (right-click the tray icon → Quit → re-launch).

2. Pull and run Qwen3-8B

ollama run qwen3:8b

First run downloads ~5.2 GB (Q4_K_M). Subsequent runs start instantly. You'll get an interactive REPL.

Quick smoke test of the thinking-mode toggle:

echo "Estimate the number of grains of sand on Earth. /think" | ollama run qwen3:8b
echo "Estimate the number of grains of sand on Earth. /no_think" | ollama run qwen3:8b

The first emits a <think>…</think> block before answering. The second goes straight to a number.

3. Use it from your own code

Ollama exposes an OpenAI-compatible HTTP endpoint at http://localhost:11434. From Python:

import requests
r = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "qwen3:8b",
        "messages": [{"role": "user", "content": "Write a SQL CTE for cumulative weekly revenue."}],
        "stream": False,
    },
    timeout=300,
)
print(r.json()["message"]["content"])

Any OpenAI-SDK-compatible client also works by pointing base_url at http://localhost:11434/v1.

Route 2 — llama.cpp (the advanced path)

Use this when you want CPU-only inference, custom quantization, fine-grained KV-cache control, or speculative decoding. Qwen3 has been supported since build b5092; b8981 (April 29, 2026) is current.

Option A: prebuilt Windows binary

The llama.cpp release page publishes Windows builds with CUDA, Vulkan, and CPU variants. Grab llama-bXXXX-bin-win-cuda-x64.zip (or the vulkan one for AMD/Intel GPUs), unzip, and you have llama-cli.exe, llama-server.exe, and friends ready to go.

Option B: build from source

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

You'll need the CUDA Toolkit (12.x recommended) and Visual Studio 2022 Build Tools. Replace -DGGML_CUDA=ON with -DGGML_VULKAN=ON for AMD/Intel/Vulkan, or omit for CPU-only.

Download a GGUF

The official Qwen GGUFs live at Qwen/Qwen3-8B-GGUF on Hugging Face. Unsloth's quants (unsloth/Qwen3-8B-GGUF) are also widely used — their dynamic Q4_K_XL often beats vanilla Q4_K_M at similar size.

pip install huggingface_hub hf_transfer
$env:HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download Qwen/Qwen3-8B-GGUF Qwen3-8B-Q4_K_M.gguf --local-dir .

Run it

llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf ^
  --ctx-size 32768 ^
  --n-gpu-layers 99 ^
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 ^
  --jinja ^
  -p "Explain how a B-tree index speeds up lookups."

Recommended sampling parameters from the model card: temp=0.6, top_p=0.95, top_k=20, min_p=0 for thinking mode; temp=0.7, top_p=0.8 for non-thinking. Do not use greedy decoding — the model card calls this out explicitly.

To force non-thinking globally:

llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf --jinja --chat-template-kwargs "{\"enable_thinking\": false}"

For a server with an OpenAI-compatible API:

llama-server.exe -m Qwen3-8B-Q4_K_M.gguf -c 32768 -ngl 99 --host 0.0.0.0 --port 8080

Optional: Open WebUI on Docker

If you want a ChatGPT-style browser interface, Open WebUI talks to Ollama natively:

docker run -d -p 3000:8080 ^
  --add-host=host.docker.internal:host-gateway ^
  -v open-webui:/app/backend/data ^
  --name open-webui --restart always ^
  ghcr.io/open-webui/open-webui:main

Browse to http://localhost:3000, complete the one-time admin setup, and Qwen3-8B appears in the model dropdown automatically.

Performance — what you can expect on Windows

Numbers below are Q4_K_M, llama.cpp / Ollama, single-user decode (text generation, not prompt prefill). Sources are community llama.cpp benchmarks aggregated through awesomeagents.ai's home-GPU leaderboard and Hardware Corner runs; verify against your own setup before quoting them as gospel.

GPUVRAMQwen3-8B Q4_K_M tok/sNotes
RTX 509032 GB~140–160Headroom for full-precision and 131K context.
RTX 409024 GB~104Reference high-end consumer card.
RTX 4070 Ti / 4070 Super12–16 GB~52–70Best value for this model class.
RTX 4060 Ti 16 GB16 GB~30–40Bandwidth-limited, not VRAM-limited.
RTX 3060 12 GB12 GB~25–35Cheapest "comfortable" option.
Ryzen 7 7700X CPU only~3–6Usable for chat, slow for code.

Quality-wise, Qwen3-8B holds up well against same-size peers. The non-thinking model is roughly on par with or ahead of Llama 3.1 8B Instruct on MMLU and MMLU-Pro and clearly ahead on math (e.g. AIME-25 at 81.5 in thinking mode per the official model card). For pure code (HumanEval), Llama-3.1-8B is competitive but Qwen3-8B's tool-calling and multilingual skills typically tip the scales for agentic work.

How to choose: Qwen3-8B or something else?

  • You have 6–8 GB VRAM. Qwen3-8B at Q4_K_M. Don't bother with the 3.5/3.6 family — they don't have an 8B-class member.
  • You have 12–16 GB VRAM. Run Qwen3-8B at Q8_0 for near-lossless output, or step up to Qwen3-14B at Q4_K_M.
  • You have 24 GB+ VRAM and want vision. Qwen3.5-27B (Feb 2026) or Qwen3.6-27B (Apr 2026). Both are dense 27B vision-language models with 262K native context, far ahead of 8B on multimodal tasks. Caveat: Ollama's GGUF loader doesn't yet handle the mmproj sidecar — use llama.cpp directly.
  • You want pure code. Qwen3-Coder-Next (released as a coder-specialised checkpoint) outperforms general Qwen3-8B on coding benchmarks at similar size.
  • You're CPU-only on a ThinkPad. Stay at Q4_K_M and don't expect more than ~5 tok/s. Phi-3-Mini or Qwen3-4B are faster if quality at that size is acceptable.

Long context (YaRN to 131K)

32K is enough for most documents. For larger inputs, enable YaRN scaling. With llama.cpp:

llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf ^
  --rope-scaling yarn ^
  --rope-scale 4.0 ^
  --yarn-orig-ctx 32768 ^
  --ctx-size 131072 ^
  -ngl 99

Two warnings: (1) KV cache at 131K balloons memory significantly — even at Q4 KV-cache quantization you're looking at 8–10 GB just for cache. (2) Quality on tasks shorter than 32K can degrade slightly when YaRN is enabled, so don't turn it on globally — turn it on only when you actually need long context.

Fine-tuning locally

Unsloth has the most painless Windows-friendly path, with documented Qwen3-8B recipes that run inside WSL2 or native Windows Python. The headline figures: a Q4 LoRA fine-tune on Qwen3-8B fits in ~12 GB VRAM at sequence length 4096, and trains roughly 2x faster than vanilla HF Transformers because of Unsloth's hand-tuned kernels.

Workflow in brief:

  1. pip install unsloth (Python 3.10–3.12, CUDA 12.x).
  2. Load unsloth/Qwen3-8B-unsloth-bnb-4bit as your base.
  3. Wrap with FastLanguageModel.get_peft_model and an SFT or DPO trainer.
  4. Export back to GGUF for Ollama / llama.cpp consumption.

For real production fine-tunes that need supervision and dataset curation, hiring a dedicated ML engineer who knows the Qwen / Unsloth stack typically beats a DIY effort — Codersera regularly places vetted remote developers with this exact skill set.

Common pitfalls and troubleshooting

  • "Out of memory" on a card that should fit. KV cache at 32K is the silent killer. Drop --ctx-size to 8192 first; if that frees space you've identified the culprit. Then enable KV-cache quantization with --cache-type-k q8_0 --cache-type-v q8_0.
  • Output starts with <think> when you didn't want it. You're in thinking mode by default. Add /no_think to the prompt or pass --chat-template-kwargs '{"enable_thinking": false}'.
  • Slow generation despite a beefy GPU. Confirm -ngl 99 (all layers on GPU) and check nvidia-smi shows the process consuming VRAM. If layers are spilling to CPU you'll see ~10x slowdown.
  • Ollama "model not found" after pull. Almost always an OLLAMA_MODELS env var that points at a path the tray app can't read. Run the tray app as the same user who set the env var.
  • Garbage output / repeating tokens. You're using greedy decoding. The model card explicitly warns against this. Use the recommended sampling above.
  • Qwen3.5/3.6 GGUF won't load in Ollama. Known limitation: Ollama's loader doesn't yet handle the mmproj vision sidecar. Run those models with llama.cpp directly until support lands.
  • Windows Defender flags llama-cli.exe. Common false positive on freshly built binaries. Whitelist the build directory.

What was removed and why

  • "Qwen 3.5-9B" references. No such model exists. The Qwen3.5 family starts at 27B (dense vision-language). If a third-party guide tells you to ollama pull qwen3.5:9b, that's referring to a community-quantized Qwen3-9B-style mirror or is simply wrong.
  • Hand-edited reference numerals (the 1, 2, 3… in the original). They pointed at non-resolving anchors. Replaced with proper inline links and a consolidated reference list at the end.
  • The vague "install llama.cpp" step. Replaced with a concrete cmake -DGGML_CUDA=ON path and a pointer to the prebuilt Windows binaries on the GitHub releases page.

FAQ

Is Qwen3-8B good enough for production agent work?

For internal tools, RAG over your own documents, and tool-calling agents — yes. For customer-facing assistants where every answer must be defensible, you'll want a frontier model (GPT-5.5 / Claude 4.7 / Gemini 2.5 Pro) as a fallback for hard cases and Qwen3-8B as the cheap default.

How does Qwen3-8B compare to Qwen3.5/3.6 in April 2026?

The 3.5 and 3.6 families are 27B+ dense vision-language models. They beat Qwen3-8B substantially on multimodal benchmarks and on hard reasoning, but they need a 24 GB GPU and don't have an 8B-class sibling. For 8 GB / 16 GB cards, Qwen3-8B is still the right answer.

Can I run it on a Mac?

Yes — Ollama works identically on Apple Silicon and the M-series GPUs are excellent at Q4. See our companion guide on running Qwen3-8B on Mac.

What about AMD GPUs on Windows?

Ollama 0.22.0 ships ROCm support for RDNA3 (RX 7900 XT/XTX) and newer. For older cards or Intel Arc, build llama.cpp with -DGGML_VULKAN=ON — the Vulkan backend is mature and routinely within 10–20% of CUDA on equivalent silicon.

Does it support function calling / MCP?

Yes. Qwen3-8B has native tool-calling and works cleanly with MCP servers via Ollama or llama.cpp's OpenAI-compatible endpoint. Many users wire it to a local MCP host (filesystem, browser, shell) for agent workflows.

Will my private data leak?

No. Once the GGUF is on disk, no network calls are made for inference. You can yank your Ethernet cable. This is one of the strongest reasons to run locally instead of through an API.

Should I use the GGUF from Qwen or from Unsloth?

Unsloth's dynamic Q4_K_XL quants are typically a hair sharper than vanilla Q4_K_M at the same file size. If you don't have a strong opinion, take Unsloth's; if you want exact reproducibility against the official model card, take the Qwen org's quants.

How do I keep up with newer models?

The pillar guide on OpenClaw + Ollama for local AI agents tracks the broader ecosystem and is updated more frequently than per-model walkthroughs.

References & further reading

  1. Qwen/Qwen3-8B — official Hugging Face model card (parameter counts, recommended sampling, thinking-mode docs).
  2. Qwen/Qwen3-8B-GGUF — official GGUF quants for llama.cpp / Ollama.
  3. QwenLM/Qwen3 GitHub repo — reference code and inference examples.
  4. Ollama releases — current Windows installer (v0.22.0, Apr 28 2026).
  5. Ollama Windows install guide — official docs for OS / driver / env vars.
  6. ggml-org/llama.cpp releases — prebuilt Windows CUDA / Vulkan binaries (b8981, Apr 29 2026).
  7. Unsloth — Qwen3 run and fine-tune guide — Q4_K_XL quants and LoRA workflow.
  8. Qwen blog — "Qwen3: Think Deeper, Act Faster" — official launch post and benchmark tables.
  9. r/LocalLLaMA — community benchmarks, hardware threads, and bug reports (search "Qwen3 8B Windows" for current threads).
  10. Qwen/Qwen3.6-27B — for readers comparing against the newer 27B family.

If you're building an internal agent stack on top of Qwen3-8B (or any local model) and need engineers who have actually shipped this in production, Codersera matches teams with vetted remote developers across ML, infra, and full-stack — usually within a week.