Run Qwen3-8B on Ubuntu: 2026 Setup Guide (Ollama, vLLM, llama.cpp)
Last updated April 2026 — refreshed for current model and tool versions, including Qwen3-8B's hybrid thinking mode, the Qwen3-2507 update line, and Qwen 3.5 / 3.6 as newer alternatives.
Qwen3-8B is Alibaba's 8.2B-parameter dense LLM with hybrid thinking / non-thinking modes, a 32,768-token native context (extensible to 131,072 with YaRN), and full open weights on Hugging Face under Apache-2.0. This guide walks through getting it running on Ubuntu 22.04 / 24.04 with Ollama, vLLM, llama.cpp, and Hugging Face Transformers, with the exact commands, hardware floors, and 2026-current alternatives so you don't waste a download.
What changed since the original 2025 postQwen3-8B (released 29 Apr 2025) now has Qwen3-Instruct-2507 and Qwen3-Thinking-2507 refresh checkpoints (Jul–Aug 2025) for the 4B / 30B-A3B / 235B sibling sizes — the 8B itself remains the original April 2025 weights, but the deployment tooling around it has matured.Hybrid reasoning is now standard: a single Qwen3-8B checkpoint switches between<think>...</think>chain-of-thought output and a fast non-thinking mode viaenable_thinkingor per-turn/thinkand/no_thinkprompt tags.Newer generations exist: Qwen 3.5 (16 Feb 2026, 27B and up) and Qwen 3.6 (Apr 2026, 27B and 35B-A3B MoE) have surpassed Qwen3 on most leaderboards. There is no Qwen 3.5 8B — if your hardware is sized for 8B, you stay on Qwen3-8B or move sideways to Qwen3-4B-Thinking-2507.Tooling minimums have moved:transformers ≥ 4.51.0,vllm ≥ 0.9.0(with the dedicatedqwen3reasoning parser),sglang ≥ 0.4.6.post1,llama.cpp ≥ b5401,ollama ≥ 0.9.0.Greedy decoding is broken in thinking mode — use the recommended sampling block (temperature = 0.6,top_p = 0.95,top_k = 20,min_p = 0) or the model loops on itself.
Want the full picture? Read our continuously-updated Qwen 3.5 Complete Guide (2026) — flavors, licensing, benchmarks, and on-device usage.
TL;DR
| Question | Answer |
|---|---|
| Smallest GPU that runs Qwen3-8B comfortably? | 16 GB VRAM (RTX 4060 Ti 16 GB, RTX 5060 Ti 16 GB, RTX 3090, A4000) at BF16 with batch 1; Q4_K_M GGUF runs on 8–10 GB VRAM. |
| CPU-only? | Yes — llama.cpp + Q4_K_M GGUF on a 16 GB RAM box. Expect 5–15 tok/s depending on RAM bandwidth. |
| Easiest setup? | curl -fsSL https://ollama.com/install.sh | sh → ollama run qwen3:8b. |
| Highest-throughput serving? | vLLM 0.9+ with --reasoning-parser qwen3, OpenAI-compatible on port 8000. |
| Should I use Qwen3-8B or Qwen 3.5? | Qwen 3.5 starts at 27B. If you have < 24 GB VRAM, Qwen3-8B is the right tier. With 24–48 GB, look at Qwen 3.5 27B or Qwen 3.6 35B-A3B MoE. |
What Qwen3-8B actually is
Qwen3-8B is a decoder-only transformer with 8.2B total parameters (6.95B non-embedding), 36 layers, Grouped Query Attention (32 query heads, 8 key/value heads), qk-layernorm for training stability, and a 32,768-token native context window. Released alongside the rest of the Qwen3 family on 29 April 2025, it was the first open-weight 8B class model to ship with a true hybrid-reasoning chat template — the same checkpoint generates <think>...</think> chain-of-thought when enable_thinking=True and skips it when False. Multilingual coverage is over 100 languages.
Reported scores (from the Qwen3 technical report, arXiv 2505.09388):
| Benchmark | Qwen3-8B (thinking, on-policy distillation) |
|---|---|
| MMLU-Redux | 88.3 |
| GPQA-Diamond | 63.3 |
| AIME 2024 | 74.4 (pass@64: 93.3) |
| AIME 2025 | 65.5 (pass@64: 86.7) |
| LiveCodeBench v5 | 60.3 |
That puts an 8B-class open model in striking range of much larger 2024 closed models on math and reasoning — which is why Qwen3 broadly dethroned Llama as the default starting point on r/LocalLLaMA through 2025.
Ubuntu hardware floor (2026 reality check)
- OS: Ubuntu 22.04 LTS or 24.04 LTS. Older 20.04 still works but pre-built CUDA 12.x wheels for PyTorch 2.5+ assume a newer glibc.
- CPU: Any x86_64 with AVX2; AVX-512 helps llama.cpp meaningfully.
- RAM: 16 GB minimum, 32 GB recommended (you'll want headroom for OS + tokenizer + KV cache).
- GPU (optional but worth it):
- BF16 / FP16 full precision: 16 GB VRAM minimum (RTX 4060 Ti 16 GB, RTX 5060 Ti 16 GB, RTX 3090 24 GB, RTX 4090 24 GB, RTX 5090 32 GB, A4000/A5000/A6000).
- FP8 / INT8 quantized: 10–12 GB VRAM.
- Q4_K_M / Q5_K_M GGUF: 8–10 GB VRAM, or CPU-only with 16 GB RAM.
- Disk: The full BF16 checkpoint is ~16 GB on Hugging Face; Ollama's default
qwen3:8btag is 5.2 GB (Q4_K_M); GGUF Q8_0 is ~8.7 GB. Budget 25 GB free for headroom. - NVIDIA driver / CUDA: Driver 550+ and CUDA 12.4+ for the latest PyTorch / vLLM wheels.
Method 1 — Ollama (5 minutes, beginner-friendly)
Ollama 0.9+ ships native Qwen3 support with the right chat template and reasoning-tag handling.
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull qwen3:8b
ollama run qwen3:8bThat second command lands a 5.2 GB Q4_K_M quantization. If you want less compression, pull a heavier tag:
ollama pull qwen3:8b-q8_0 # ~8.7 GB, near-lossless
ollama pull qwen3:8b-fp16 # ~16 GB, full precisionPer-turn reasoning control with the soft switch:
/think Write a proof that there are infinitely many primes.
/no_think What time is it in Tokyo?Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1/; point any OpenAI SDK at that base URL with model name qwen3:8b.
Method 2 — vLLM (production serving)
vLLM 0.9+ added a dedicated qwen3 reasoning parser. This is the right choice when you need batch throughput, paged-attention KV reuse, multi-GPU tensor parallelism, or OpenAI-compatible streaming for many concurrent users.
python3 -m venv ~/.venvs/qwen3 && source ~/.venvs/qwen3/bin/activate
pip install -U "vllm>=0.9.0"
vllm serve Qwen/Qwen3-8B \
--enable-reasoning \
--reasoning-parser qwen3 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90For long-context jobs, enable YaRN extrapolation to 131,072 tokens by passing the rope-scaling block in --rope-scaling or editing config.json; do not turn YaRN on for short contexts — it degrades quality below the 32K natural window.
Multi-GPU tensor parallel example:
vllm serve Qwen/Qwen3-8B \
--tensor-parallel-size 2 \
--enable-reasoning --reasoning-parser qwen3 \
--quantization fp8 \
--max-model-len 65536Sample call:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [{"role":"user","content":"Plan a 3-step refactor of a Django ORM query."}],
"temperature": 0.6, "top_p": 0.95, "top_k": 20
}'Method 3 — Hugging Face Transformers (research / fine-tune)
Use this when you need full control of generation, custom tool-calling logic, or you're doing LoRA / QLoRA fine-tuning. Requires transformers ≥ 4.51.0.
sudo apt update && sudo apt install -y python3 python3-pip python3-venv git
python3 -m venv ~/.venvs/qwen3-hf && source ~/.venvs/qwen3-hf/bin/activate
pip install -U torch --index-url https://download.pytorch.org/whl/cu124
pip install -U "transformers>=4.51.0" accelerate bitsandbytesfrom transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
messages = [{"role": "user", "content": "Explain GQA vs MQA in two sentences."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # toggle to False for fast mode
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=32768,
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0,
)
print(tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))For 16 GB VRAM, load 4-bit quantized:
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb,
device_map="auto")Method 4 — llama.cpp (CPU, Apple Silicon, edge)
llama.cpp from build b5401 onwards has a Qwen3-aware Jinja chat template and parses <think> blocks correctly. The Qwen team publishes official GGUFs at Qwen/Qwen3-8B-GGUF in Q2_K through Q8_0.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or -DGGML_VULKAN=ON / -DGGML_METAL=ON
cmake --build build --config Release -j
# Pull the Q8_0 GGUF straight from the official repo
./build/bin/llama-cli \
-hf Qwen/Qwen3-8B-GGUF:Q8_0 \
--jinja --color -ngl 99 -fa \
-p "Summarize the differences between vLLM and SGLang."Flags worth knowing: -ngl 99 offloads all layers to GPU, -fa turns on FlashAttention, --jinja activates the bundled Qwen3 chat template (mandatory — without it the model rambles), and -c 32768 sets context.
How to choose: deployment-method decision tree
| Method | Ease | Throughput | Flexibility | Pick when |
|---|---|---|---|---|
| Ollama 0.9+ | ★★★★★ | ★★★★☆ | ★★★☆☆ | You want a working chat in 5 minutes, single user. |
| vLLM 0.9+ | ★★★☆☆ | ★★★★★ | ★★★★☆ | Multi-user serving, batching, OpenAI API parity. |
| Transformers | ★★★☆☆ | ★★☆☆☆ | ★★★★★ | Fine-tuning, custom decoding, tool-calling experiments. |
| llama.cpp / GGUF | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | CPU-only, Apple Silicon, AMD via Vulkan, edge devices. |
| SGLang ≥ 0.4.6.post1 | ★★★☆☆ | ★★★★★ | ★★★★☆ | Structured output, RadixAttention prefix sharing. |
If you're standing up a local-AI agent stack rather than just a chat box, the recommendation pattern in our OpenClaw + Ollama setup guide for running local AI agents uses Ollama as the inference layer with Qwen3-8B as the default reasoning model — it's the cleanest path for tool-using agents on a single workstation.
Performance numbers you can expect (2026 hardware)
Indicative single-stream tokens-per-second from community llama.cpp / vLLM benchmarks on Qwen3-8B; numbers vary with prompt length and quantization, so treat as ballparks rather than promises.
| Hardware | Backend | Quant | Tok/s (decode) |
|---|---|---|---|
| RTX 5090 32 GB | vLLM | BF16 | ~150–180 |
| RTX 4090 24 GB | vLLM | BF16 | ~110–140 |
| RTX 3090 24 GB | vLLM | BF16 | ~70–95 |
| RTX 4060 Ti 16 GB | llama.cpp | Q4_K_M | ~55–70 |
| Apple M4 Max | llama.cpp Metal | Q4_K_M | ~45–55 |
| Ryzen 9 7950X CPU only | llama.cpp AVX-512 | Q4_K_M | ~10–14 |
Production batching with vLLM at --max-num-seqs 64 typically multiplies aggregate throughput 3–6× over single-stream because of paged-attention KV reuse.
Common pitfalls and troubleshooting
- Greedy decoding ruins thinking mode. If outputs loop ("the the the…") or never close
</think>, you forgot to set sampling. Usetemperature=0.6, top_p=0.95, top_k=20, min_p=0in thinking mode and0.7 / 0.8 / 20 / 0in non-thinking mode. - Old transformers.
KeyError: 'qwen3'means you're below 4.51.0 — upgrade. - vLLM streams the reasoning into
content. Pass--enable-reasoning --reasoning-parser qwen3so the chain-of-thought lands inreasoning_contentand the final answer incontent. - OOM at long context. The KV cache for 32K @ BF16 is ~3.7 GB by itself. Drop to FP8 (
--quantization fp8), or reduce--max-model-len, or use--cpu-offload-gb. - YaRN hurts short-context quality. Only enable rope-scaling when you actually need > 32K input. The recommended
factoris 4.0 withoriginal_max_position_embeddings=32768. - llama.cpp without
--jinja. The model will ignore the system prompt and the soft-switch tags. Always pass--jinja. - Driver mismatch.
RuntimeError: CUDA error: no kernel image is availableafter a PyTorch upgrade usually means you're on a 535-series NVIDIA driver. Move to 550+. - Ollama silently falls back to CPU. Run
ollama psand confirm the model shows100% GPU; if not, your CUDA install isn't visible to the Ollama systemd unit —sudo systemctl edit ollamaand addEnvironment="CUDA_VISIBLE_DEVICES=0".
When to pick something other than Qwen3-8B in 2026
- Qwen3-4B-Thinking-2507 — Half the VRAM, surprisingly close on math/code; the right pick if you have 8–10 GB of VRAM and want full BF16.
- Qwen 3.5 27B (Feb 2026) — If you have 24–48 GB of VRAM, this is a meaningful step up across reasoning, multilingual, and multimodal tasks. Note: there is no Qwen 3.5 8B; the family starts at 27B dense.
- Qwen 3.6 35B-A3B MoE (Apr 2026) — Roughly Qwen 3.5 27B-class quality at lower active-parameter cost; great for repository-level coding workflows.
- Llama 4 8B — Closest direct rival in the 8B tier; still trails Qwen3-8B on math and Chinese, leads on some English-only QA.
- DeepSeek V4 / R-series — When you need pure reasoning and you have 80 GB+ of VRAM.
Serving Qwen3-8B as a real API (production notes)
Both Ollama and vLLM expose OpenAI-compatible endpoints, but the production deployment usually wants:
- A reverse proxy (Nginx or Caddy) terminating TLS and rate-limiting.
- An auth layer — neither tool ships one. Use a token-validating sidecar or front the endpoint with a small FastAPI wrapper.
- Observability — vLLM exports Prometheus metrics out of the box; scrape
/metrics. - A queue for long thinking-mode generations (max 32K tokens of CoT can take 60+ seconds).
If your team is heading down the local-AI-agent path and you'd rather hire someone who has done this before than build it from scratch, Codersera's network of vetted AI / ML engineers covers exactly this stack — Ubuntu deployment, vLLM tuning, agent orchestration, and the SRE work around model serving.
Related Codersera guides
- Running Qwen3-8B on Windows: a comprehensive guide
- Run Qwen3-8B on Mac: an installation guide
- Set up Qwen2.5-1M on Ubuntu (long-context predecessor)
- OpenClaw + Ollama setup guide for running local AI agents (2026)
FAQ
Is Qwen3-8B free for commercial use?
Yes. Qwen3-8B is released under Apache 2.0 — both research and commercial use are permitted, including fine-tuning and redistribution.
Does Qwen3-8B run on AMD or Intel GPUs?
Yes. llama.cpp's Vulkan and ROCm backends both run Qwen3-8B on RX 7000 / 9000-series Radeon and Arc B-series cards. vLLM on AMD requires a ROCm 6.x build. Performance trails NVIDIA at the same TFLOPs class but is competitive on tokens-per-dollar.
What's the difference between Qwen3-8B and Qwen3-8B-Base?
The plain Qwen3-8B repo is the post-trained chat model with hybrid reasoning. Qwen3-8B-Base is the pre-trained-only checkpoint — useful as a fine-tuning starting point but not a usable chat model out of the box.
Can I run Qwen3-8B without internet after the first download?
Yes. Once the weights are cached (under ~/.cache/huggingface/hub/ for Transformers or ~/.ollama/models/ for Ollama) the model loads fully offline. Air-gapped deployments work — pre-download the GGUF or safetensors on a connected box and rsync them in.
Is Qwen3-8B safer than running a cloud API for sensitive data?
Local inference removes the data-egress risk entirely — prompts never leave your hardware. You're still responsible for the standard hygiene: disk encryption, OS hardening, network isolation, and not logging prompts to plaintext files.
How do I disable thinking mode for production latency?
Set enable_thinking=False in the chat template (Transformers / vLLM / SGLang) or send /no_think at the start of the user message in Ollama and llama.cpp. Non-thinking mode is roughly 3–5× faster end-to-end because there's no chain-of-thought to generate.
Should I quantize?
For inference on consumer GPUs, yes — Q5_K_M or Q8_0 GGUF lose < 1% on most benchmarks versus BF16. For fine-tuning, stay in BF16 / FP16 and only quantize for serving.
References & further reading
- Qwen/Qwen3-8B model card on Hugging Face — official spec, sampling parameters, chat template.
- Qwen/Qwen3-8B-GGUF — official quantized GGUF releases.
- QwenLM/Qwen3 GitHub repo — deployment recipes, version requirements, examples.
- Qwen3: Think Deeper, Act Faster — official launch blog (29 Apr 2025).
- Qwen3 Technical Report (arXiv 2505.09388) — full architecture and benchmark detail.
- vLLM
qwen3_reasoning_parserdocs — required for correct<think>handling. - Ollama releases — version-by-version Qwen3 support history.
- r/LocalLLaMA — practitioner discussion threads on Qwen3 deployment, quantization tradeoffs, and 2026 alternatives.