qwen 3

Run Qwen3-8B on Ubuntu: 2026 Setup Guide (Ollama, vLLM, llama.cpp)

Published 29 Apr 2025 • Updated 11 May 2026 • 10 min read

Quick answer. Run Qwen3-8B on Ubuntu via Ollama for a 5-minute setup, vLLM 0.20+ for production serving, or llama.cpp for GGUF flexibility. Hardware floor: 16 GB RAM and an 8 GB+ VRAM GPU (RTX 3060 or better). 4-bit quants cut VRAM to roughly 5-6 GB while keeping near-FP16 quality.

Last updated April 2026 — refreshed for current model and tool versions, including Qwen3-8B's hybrid thinking mode, the Qwen3-2507 update line, and Qwen 3.5 / 3.6 as newer alternatives.

Qwen3-8B is Alibaba's 8.2B-parameter dense LLM with hybrid thinking / non-thinking modes, a 32,768-token native context (extensible to 131,072 with YaRN), and full open weights on Hugging Face under Apache-2.0. This guide walks through getting it running on Ubuntu 22.04 / 24.04 with Ollama, vLLM, llama.cpp, and Hugging Face Transformers, with the exact commands, hardware floors, and 2026-current alternatives so you don't waste a download.

What changed since the original 2025 postQwen3-8B (released 29 Apr 2025) now has Qwen3-Instruct-2507 and Qwen3-Thinking-2507 refresh checkpoints (Jul–Aug 2025) for the 4B / 30B-A3B / 235B sibling sizes — the 8B itself remains the original April 2025 weights, but the deployment tooling around it has matured.Hybrid reasoning is now standard: a single Qwen3-8B checkpoint switches between <think>...</think> chain-of-thought output and a fast non-thinking mode via enable_thinking or per-turn /think and /no_think prompt tags.Newer generations exist: Qwen 3.5 (16 Feb 2026, 27B and up) and Qwen 3.6 (Apr 2026, 27B and 35B-A3B MoE) have surpassed Qwen3 on most leaderboards. There is no Qwen 3.5 8B — if your hardware is sized for 8B, you stay on Qwen3-8B or move sideways to Qwen3-4B-Thinking-2507.Tooling minimums have moved: transformers ≥ 4.51.0, vllm ≥ 0.9.0 (with the dedicated qwen3 reasoning parser), sglang ≥ 0.4.6.post1, llama.cpp ≥ b5401, ollama ≥ 0.9.0.Greedy decoding is broken in thinking mode — use the recommended sampling block (temperature = 0.6, top_p = 0.95, top_k = 20, min_p = 0) or the model loops on itself.

Want the full picture? Read our continuously-updated Qwen 3.5 Complete Guide (2026) — flavors, licensing, benchmarks, and on-device usage.

TL;DR

Question	Answer
Smallest GPU that runs Qwen3-8B comfortably?	16 GB VRAM (RTX 4060 Ti 16 GB, RTX 5060 Ti 16 GB, RTX 3090, A4000) at BF16 with batch 1; Q4_K_M GGUF runs on 8–10 GB VRAM.
CPU-only?	Yes — llama.cpp + Q4_K_M GGUF on a 16 GB RAM box. Expect 5–15 tok/s depending on RAM bandwidth.
Easiest setup?	`curl -fsSL https://ollama.com/install.sh \| sh` → `ollama run qwen3:8b`.
Highest-throughput serving?	vLLM 0.9+ with `--reasoning-parser qwen3`, OpenAI-compatible on port 8000.
Should I use Qwen3-8B or Qwen 3.5?	Qwen 3.5 starts at 27B. If you have < 24 GB VRAM, Qwen3-8B is the right tier. With 24–48 GB, look at Qwen 3.5 27B or Qwen 3.6 35B-A3B MoE.

What Qwen3-8B actually is

Qwen3-8B is a decoder-only transformer with 8.2B total parameters (6.95B non-embedding), 36 layers, Grouped Query Attention (32 query heads, 8 key/value heads), qk-layernorm for training stability, and a 32,768-token native context window. Released alongside the rest of the Qwen3 family on 29 April 2025, it was the first open-weight 8B class model to ship with a true hybrid-reasoning chat template — the same checkpoint generates <think>...</think> chain-of-thought when enable_thinking=True and skips it when False. Multilingual coverage is over 100 languages.

Reported scores (from the Qwen3 technical report, arXiv 2505.09388):

Benchmark	Qwen3-8B (thinking, on-policy distillation)
MMLU-Redux	88.3
GPQA-Diamond	63.3
AIME 2024	74.4 (pass@64: 93.3)
AIME 2025	65.5 (pass@64: 86.7)
LiveCodeBench v5	60.3

That puts an 8B-class open model in striking range of much larger 2024 closed models on math and reasoning — which is why Qwen3 broadly dethroned Llama as the default starting point on r/LocalLLaMA through 2025.

Ubuntu hardware floor (2026 reality check)

OS: Ubuntu 22.04 LTS or 24.04 LTS. Older 20.04 still works but pre-built CUDA 12.x wheels for PyTorch 2.5+ assume a newer glibc.
CPU: Any x86_64 with AVX2; AVX-512 helps llama.cpp meaningfully.
RAM: 16 GB minimum, 32 GB recommended (you'll want headroom for OS + tokenizer + KV cache).
GPU (optional but worth it):
- BF16 / FP16 full precision: 16 GB VRAM minimum (RTX 4060 Ti 16 GB, RTX 5060 Ti 16 GB, RTX 3090 24 GB, RTX 4090 24 GB, RTX 5090 32 GB, A4000/A5000/A6000).
- FP8 / INT8 quantized: 10–12 GB VRAM.
- Q4_K_M / Q5_K_M GGUF: 8–10 GB VRAM, or CPU-only with 16 GB RAM.
Disk: The full BF16 checkpoint is ~16 GB on Hugging Face; Ollama's default qwen3:8b tag is 5.2 GB (Q4_K_M); GGUF Q8_0 is ~8.7 GB. Budget 25 GB free for headroom.
NVIDIA driver / CUDA: Driver 550+ and CUDA 12.4+ for the latest PyTorch / vLLM wheels.

Method 1 — Ollama (5 minutes, beginner-friendly)

Ollama 0.9+ ships native Qwen3 support with the right chat template and reasoning-tag handling.

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull qwen3:8b
ollama run qwen3:8b

That second command lands a 5.2 GB Q4_K_M quantization. If you want less compression, pull a heavier tag:

ollama pull qwen3:8b-q8_0   # ~8.7 GB, near-lossless
ollama pull qwen3:8b-fp16   # ~16 GB, full precision

Per-turn reasoning control with the soft switch:

/think   Write a proof that there are infinitely many primes.
/no_think  What time is it in Tokyo?

Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1/; point any OpenAI SDK at that base URL with model name qwen3:8b.

Method 2 — vLLM (production serving)

vLLM 0.9+ added a dedicated qwen3 reasoning parser. This is the right choice when you need batch throughput, paged-attention KV reuse, multi-GPU tensor parallelism, or OpenAI-compatible streaming for many concurrent users.

python3 -m venv ~/.venvs/qwen3 && source ~/.venvs/qwen3/bin/activate
pip install -U "vllm>=0.9.0"
vllm serve Qwen/Qwen3-8B \
  --enable-reasoning \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

For long-context jobs, enable YaRN extrapolation to 131,072 tokens by passing the rope-scaling block in --rope-scaling or editing config.json; do not turn YaRN on for short contexts — it degrades quality below the 32K natural window.

Multi-GPU tensor parallel example:

vllm serve Qwen/Qwen3-8B \
  --tensor-parallel-size 2 \
  --enable-reasoning --reasoning-parser qwen3 \
  --quantization fp8 \
  --max-model-len 65536

Sample call:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role":"user","content":"Plan a 3-step refactor of a Django ORM query."}],
    "temperature": 0.6, "top_p": 0.95, "top_k": 20
  }'

Method 3 — Hugging Face Transformers (research / fine-tune)

Use this when you need full control of generation, custom tool-calling logic, or you're doing LoRA / QLoRA fine-tuning. Requires transformers ≥ 4.51.0.

sudo apt update && sudo apt install -y python3 python3-pip python3-venv git
python3 -m venv ~/.venvs/qwen3-hf && source ~/.venvs/qwen3-hf/bin/activate
pip install -U torch --index-url https://download.pytorch.org/whl/cu124
pip install -U "transformers>=4.51.0" accelerate bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Explain GQA vs MQA in two sentences."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,           # toggle to False for fast mode
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=32768,
    temperature=0.6, top_p=0.95, top_k=20, min_p=0.0,
)
print(tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

For 16 GB VRAM, load 4-bit quantized:

from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16",
                         bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb,
                                             device_map="auto")

Method 4 — llama.cpp (CPU, Apple Silicon, edge)

llama.cpp from build b5401 onwards has a Qwen3-aware Jinja chat template and parses <think> blocks correctly. The Qwen team publishes official GGUFs at Qwen/Qwen3-8B-GGUF in Q2_K through Q8_0.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON       # or -DGGML_VULKAN=ON / -DGGML_METAL=ON
cmake --build build --config Release -j

# Pull the Q8_0 GGUF straight from the official repo
./build/bin/llama-cli \
  -hf Qwen/Qwen3-8B-GGUF:Q8_0 \
  --jinja --color -ngl 99 -fa \
  -p "Summarize the differences between vLLM and SGLang."

Flags worth knowing: -ngl 99 offloads all layers to GPU, -fa turns on FlashAttention, --jinja activates the bundled Qwen3 chat template (mandatory — without it the model rambles), and -c 32768 sets context.

How to choose: deployment-method decision tree

Method	Ease	Throughput	Flexibility	Pick when
Ollama 0.9+	★★★★★	★★★★☆	★★★☆☆	You want a working chat in 5 minutes, single user.
vLLM 0.9+	★★★☆☆	★★★★★	★★★★☆	Multi-user serving, batching, OpenAI API parity.
Transformers	★★★☆☆	★★☆☆☆	★★★★★	Fine-tuning, custom decoding, tool-calling experiments.
llama.cpp / GGUF	★★★★☆	★★★☆☆	★★★☆☆	CPU-only, Apple Silicon, AMD via Vulkan, edge devices.
SGLang ≥ 0.4.6.post1	★★★☆☆	★★★★★	★★★★☆	Structured output, RadixAttention prefix sharing.

If you're standing up a local-AI agent stack rather than just a chat box, the recommendation pattern in our OpenClaw + Ollama setup guide for running local AI agents uses Ollama as the inference layer with Qwen3-8B as the default reasoning model — it's the cleanest path for tool-using agents on a single workstation.

Performance numbers you can expect (2026 hardware)

Indicative single-stream tokens-per-second from community llama.cpp / vLLM benchmarks on Qwen3-8B; numbers vary with prompt length and quantization, so treat as ballparks rather than promises.

Hardware	Backend	Quant	Tok/s (decode)
RTX 5090 32 GB	vLLM	BF16	~150–180
RTX 4090 24 GB	vLLM	BF16	~110–140
RTX 3090 24 GB	vLLM	BF16	~70–95
RTX 4060 Ti 16 GB	llama.cpp	Q4_K_M	~55–70
Apple M4 Max	llama.cpp Metal	Q4_K_M	~45–55
Ryzen 9 7950X CPU only	llama.cpp AVX-512	Q4_K_M	~10–14

Production batching with vLLM at --max-num-seqs 64 typically multiplies aggregate throughput 3–6× over single-stream because of paged-attention KV reuse.

Common pitfalls and troubleshooting

Greedy decoding ruins thinking mode. If outputs loop ("the the the…") or never close </think>, you forgot to set sampling. Use temperature=0.6, top_p=0.95, top_k=20, min_p=0 in thinking mode and 0.7 / 0.8 / 20 / 0 in non-thinking mode.
Old transformers. KeyError: 'qwen3' means you're below 4.51.0 — upgrade.
vLLM streams the reasoning into content. Pass --enable-reasoning --reasoning-parser qwen3 so the chain-of-thought lands in reasoning_content and the final answer in content.
OOM at long context. The KV cache for 32K @ BF16 is ~3.7 GB by itself. Drop to FP8 (--quantization fp8), or reduce --max-model-len, or use --cpu-offload-gb.
YaRN hurts short-context quality. Only enable rope-scaling when you actually need > 32K input. The recommended factor is 4.0 with original_max_position_embeddings=32768.
llama.cpp without --jinja. The model will ignore the system prompt and the soft-switch tags. Always pass --jinja.
Driver mismatch. RuntimeError: CUDA error: no kernel image is available after a PyTorch upgrade usually means you're on a 535-series NVIDIA driver. Move to 550+.
Ollama silently falls back to CPU. Run ollama ps and confirm the model shows 100% GPU; if not, your CUDA install isn't visible to the Ollama systemd unit — sudo systemctl edit ollama and add Environment="CUDA_VISIBLE_DEVICES=0".

When to pick something other than Qwen3-8B in 2026

Qwen3-4B-Thinking-2507 — Half the VRAM, surprisingly close on math/code; the right pick if you have 8–10 GB of VRAM and want full BF16.
Qwen 3.5 27B (Feb 2026) — If you have 24–48 GB of VRAM, this is a meaningful step up across reasoning, multilingual, and multimodal tasks. Note: there is no Qwen 3.5 8B; the family starts at 27B dense.
Qwen 3.6 35B-A3B MoE (Apr 2026) — Roughly Qwen 3.5 27B-class quality at lower active-parameter cost; great for repository-level coding workflows.
Llama 4 8B — Closest direct rival in the 8B tier; still trails Qwen3-8B on math and Chinese, leads on some English-only QA.
DeepSeek V4 / R-series — When you need pure reasoning and you have 80 GB+ of VRAM.

Serving Qwen3-8B as a real API (production notes)

Both Ollama and vLLM expose OpenAI-compatible endpoints, but the production deployment usually wants:

A reverse proxy (Nginx or Caddy) terminating TLS and rate-limiting.
An auth layer — neither tool ships one. Use a token-validating sidecar or front the endpoint with a small FastAPI wrapper.
Observability — vLLM exports Prometheus metrics out of the box; scrape /metrics.
A queue for long thinking-mode generations (max 32K tokens of CoT can take 60+ seconds).

If your team is heading down the local-AI-agent path and you'd rather hire someone who has done this before than build it from scratch, Codersera's network of vetted AI / ML engineers covers exactly this stack — Ubuntu deployment, vLLM tuning, agent orchestration, and the SRE work around model serving.

FAQ

Is Qwen3-8B free for commercial use?

Yes. Qwen3-8B is released under Apache 2.0 — both research and commercial use are permitted, including fine-tuning and redistribution.

Does Qwen3-8B run on AMD or Intel GPUs?

Yes. llama.cpp's Vulkan and ROCm backends both run Qwen3-8B on RX 7000 / 9000-series Radeon and Arc B-series cards. vLLM on AMD requires a ROCm 6.x build. Performance trails NVIDIA at the same TFLOPs class but is competitive on tokens-per-dollar.

What's the difference between Qwen3-8B and Qwen3-8B-Base?

The plain Qwen3-8B repo is the post-trained chat model with hybrid reasoning. Qwen3-8B-Base is the pre-trained-only checkpoint — useful as a fine-tuning starting point but not a usable chat model out of the box.

Can I run Qwen3-8B without internet after the first download?

Yes. Once the weights are cached (under ~/.cache/huggingface/hub/ for Transformers or ~/.ollama/models/ for Ollama) the model loads fully offline. Air-gapped deployments work — pre-download the GGUF or safetensors on a connected box and rsync them in.

Is Qwen3-8B safer than running a cloud API for sensitive data?

Local inference removes the data-egress risk entirely — prompts never leave your hardware. You're still responsible for the standard hygiene: disk encryption, OS hardening, network isolation, and not logging prompts to plaintext files.

How do I disable thinking mode for production latency?

Set enable_thinking=False in the chat template (Transformers / vLLM / SGLang) or send /no_think at the start of the user message in Ollama and llama.cpp. Non-thinking mode is roughly 3–5× faster end-to-end because there's no chain-of-thought to generate.

Should I quantize?

For inference on consumer GPUs, yes — Q5_K_M or Q8_0 GGUF lose < 1% on most benchmarks versus BF16. For fine-tuning, stay in BF16 / FP16 and only quantize for serving.

References & further reading

Qwen/Qwen3-8B model card on Hugging Face — official spec, sampling parameters, chat template.
Qwen/Qwen3-8B-GGUF — official quantized GGUF releases.
QwenLM/Qwen3 GitHub repo — deployment recipes, version requirements, examples.
Qwen3: Think Deeper, Act Faster — official launch blog (29 Apr 2025).
Qwen3 Technical Report (arXiv 2505.09388) — full architecture and benchmark detail.
vLLM qwen3_reasoning_parser docs — required for correct <think> handling.
Ollama releases — version-by-version Qwen3 support history.
r/LocalLLaMA — practitioner discussion threads on Qwen3 deployment, quantization tradeoffs, and 2026 alternatives.