VibeThinker-3B: The Complete Guide (2026)

VibeThinker-3B is WeiboAI's MIT-licensed 3B reasoning model built on Qwen2.5-Coder-3B. We unpack the viral 'Opus 4.5 performance' claim with the actual HF benchmarks.

Published 16 Jun 2026 • Updated 06 Jul 2026 • 9 min read

Published: June 16, 2026. We refresh this guide whenever WeiboAI ships a new VibeThinker checkpoint, the recommended runtime changes, or independent benchmark numbers drop.

Quick answer. VibeThinker-3B is a WeiboAI fine-tune of Qwen2.5-Coder-3B, MIT-licensed, tuned hard for math, code, and STEM reasoning. The viral "3B with Opus 4.5 performance" framing is shorthand — WeiboAI's own claim is parity with top-tier reasoning models on verifiable benchmarks (IMO-AnswerBench 76.4, LeetCode 96.1%), not general-purpose Opus parity. Source: huggingface.co.

Update — July 2026: the local-runtime ecosystem has filled in — concrete quants now exist rather than "check HF later": GGUF (mradermacher), imatrix GGUF, and an MLX port for Apple Silicon. On the benchmark debate, VentureBeat's June 17 piece is the best single summary — including the strongest counter-evidence to the "benchmaxxing" critique: a 96.1% pass rate on LeetCode contests published after the training cutoff (Apr 25–May 31), under identical conditions where GPT-5.2 and Opus 4.6 scored lower.

VibeThinker-3B is the model that re-opened a question a lot of the field thought was closed: can a 3-billion-parameter open-weights model land in the same scoring range as 600B+ frontier reasoners on hard math and code? WeiboAI — yes, the AI division of Weibo — says yes, but only inside a narrow window: tasks with clear, verifiable answers. The checkpoint trended to 118 likes on Hugging Face within 24 hours of release, picked up MIT licensing for unrestricted commercial use, and shipped with a technical report on arXiv (2606.16140). This page is the single landing surface we point engineering teams to when they're evaluating whether VibeThinker-3B is the right small model to drop into a math/code agent loop, an edge deployment, or an on-device reasoning workflow.

TL;DR — Should you care?

If you ship coding agents: Yes — 96.1% first-attempt acceptance on unseen LeetCode weekly contests (April–May 2026) is a real number for a 3B model, and the inner loop is fast and cheap.
If you run math/reasoning workflows: Yes — IMO-AnswerBench 76.4 (80.6 with test-time scaling) puts it in DeepSeek V3.2 / GLM-5 / Kimi K2.5 territory on that specific benchmark, at ~0.5% of the parameters.
If you deploy to edge or local-first hardware: Yes — quantized GGUF fits in 2–3 GB. Runs on a Mac mini, a Jetson, or a mid-range phone with patience.
If you want a general chat model: No. WeiboAI is explicit: "For broad open-domain knowledge tasks, larger general-purpose models may still be more suitable." Don't put this in a customer-facing chatbot.

The "3B with Opus 4.5 performance" claim

The viral framing came from this post on X:

"3B model with Opus 4.5 performance — VibeThinker 3B (based on Qwen 2.5)"

— @TheAhmadOsman, June 2026 (679 likes / 47 retweets)

We need to be careful here. WeiboAI itself does not claim Opus 4.5 parity anywhere on the model card. The actual claim, in their words, is that VibeThinker-3B "reaches the performance range of top-tier frontier reasoning models, including Qwen3.6 Plus, Gemini 3 Pro, GLM-5, and Kimi K2.5, on verifiable reasoning benchmarks." Two qualifiers carry the load: performance range (not victory) and verifiable reasoning benchmarks (not general intelligence).

The specific numbers WeiboAI publishes on the HF card:

IMO-AnswerBench: 76.4 base, 80.6 with Claim-Level Reliability Assessment (CLR, a test-time scaling trick). Compared on the same benchmark: DeepSeek V3.2 (671B) scores 78.3, GLM-5 (744B) scores 82.5, Kimi K2.5 (1T) scores 81.8.
LeetCode weekly + biweekly contests (Apr 25 – May 31, 2026), Python: 123 of 128 first-attempt submissions passed = 96.1% acceptance.
AIME, HMMT, LiveCodeBench: WeiboAI says "strong" — the exact numbers are in the technical report, not on the card.

What this means in plain English: if your problem reduces to a verifier-checkable answer (a math result, a passing unit test, a multiple-choice STEM question), VibeThinker-3B is genuinely in the same neighborhood as 200–300× larger models. If your problem is "summarize this 40-page contract" or "write a polite reply to an angry customer," it isn't, and WeiboAI doesn't claim it is.

The TheAhmadOsman tweet is a useful pointer but the headline is shorthand. The model card never benchmarks against Anthropic's Opus line.

What VibeThinker-3B actually is

Owner: WeiboAI, the AI research arm of Weibo (Sina).
Base model: Qwen/Qwen2.5-Coder-3B — note Coder-3B, not the generic Qwen2.5-3B. That choice matters: the base already has strong code priors before WeiboAI's reasoning post-training kicks in.
License: MIT. Commercial use, modification, redistribution all unrestricted.
Architecture: Qwen2ForCausalLM (decoder-only transformer). 3B parameters. 64K long-context training window.
HF tags: math, code, reasoning, gpqa, instruction-following.
Training pipeline: Curriculum two-stage SFT → multi-domain reasoning RL (MaxEnt-Guided Policy Optimization on math, code, STEM sequentially) → offline self-distillation → final instruction RL with rule-based and rubric-based validators.
Hypothesis WeiboAI is testing: the Parametric Compression-Coverage Hypothesis. Their argument is that verifiable reasoning is a "compressible, parameter-dense" capability — meaning small models can in principle reach near-frontier performance on it, while open-domain knowledge inherently requires scale.
Technical report: arXiv 2606.16140.

How to run it

The HF model card recommends three runtimes. We'll cover all three plus the GGUF route for laptops.

1. vLLM (recommended for evaluation / serving)

This is what WeiboAI used to produce the benchmark numbers. Pinned versions matter — newer vLLM builds sometimes regress on Qwen2 sampling.

pip install "vllm==0.10.1" "transformers>=4.54.0"

vllm serve WeiboAI/VibeThinker-3B \
  --max-model-len 65536 \
  --dtype bfloat16

Then hit it with OpenAI-compatible HTTP. Use the recommended sampling: temperature=1.0, top_p=0.95, top_k=-1. Lower temperatures degrade reasoning quality — counterintuitive, but documented.

2. SGLang (recommended for production)

pip install "sglang[all]>=0.4.9.post6"
python -m sglang.launch_server \
  --model-path WeiboAI/VibeThinker-3B \
  --port 30000

3. HF transformers (simplest, no server)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WeiboAI/VibeThinker-3B",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)

messages = [{"role": "user", "content": "Prove there are infinitely many primes."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=8192, do_sample=True,
                     temperature=1.0, top_p=0.95)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

4. llama.cpp / Ollama (laptop-friendly, quantized)

Community GGUF quants typically appear on HF within a day of a high-trending release; search VibeThinker-3B GGUF. Once a Q4_K_M quant is available (~2 GB on disk):

# Ollama (when a Modelfile lands)
ollama pull <author>/vibethinker-3b
ollama run <author>/vibethinker-3b

# llama.cpp directly
./llama-server -m vibethinker-3b-Q4_K_M.gguf \
  --port 8080 --temp 1.0 --top-p 0.95

If no Ollama tag exists yet, quantize it yourself with llama.cpp/convert_hf_to_gguf.py from the safetensors files.

Hardware requirements

Full bf16: ~6 GB VRAM. Fits any modern GPU, an M-series Mac, or a Jetson Orin.
Q4_K_M GGUF: ~2 GB. Runs on CPU-only laptops, Raspberry Pi 5 (slowly), top-end phones with patience.
Long-context (64K): KV cache balloons — budget another 4–6 GB depending on quantization. For most tasks, 8K–16K is plenty.

Where this fits in the open-source landscape

VibeThinker-3B isn't a replacement for a general-purpose model — it's a specialized inner-loop reasoner. The natural home for it is on the same disk as a bigger generalist that handles routing and chat, with VibeThinker pulled in for the math/code/STEM steps.

For the broader map of where this checkpoint sits among DeepSeek V4, Qwen 3.6/3.7, Llama 4, Kimi K2.6, and the rest, see our Open-Source LLMs Landscape (2026) pillar. If you're picking a model for a specific machine, our Local AI Model Picker takes hardware in and returns ranked recommendations — VibeThinker-3B becomes a Mac Quick Pick the moment a community Q4 GGUF lands.

For self-hosting plumbing (vLLM vs SGLang vs llama.cpp tradeoffs, KV cache tuning, batching), our Self-Hosting LLMs (2026) guide covers it. On Apple Silicon specifically — where this model genuinely shines for offline use — our Apple Silicon LLMs (2026) pillar has the MLX path.

Use cases where a 3B reasoning model wins

Privacy-sensitive math/code work: medical statistics, financial modeling, internal codebase reasoning — runs entirely on-device, no API calls, no data egress.
Batch processing of verifiable problems: grading 10,000 student math submissions, validating 50,000 LeetCode-style screening attempts, generating proofs across a problem set. The cost-per-query collapses vs API frontier models.
Agent inner loops: the orchestrator can be a frontier model (Claude Opus, GPT-5.5), but the inner "check this math" / "is this code correct" sub-calls go to a local VibeThinker. Latency drops from 800ms+ to single-digit milliseconds per turn.
Offline / edge deployments: field engineers in low-connectivity environments, on-device tutoring apps, embedded reasoning in industrial controllers.
Fleets of parallel agents: running 50 specialized math agents on a single A100 becomes plausible when each one is 3B not 70B.

When NOT to use VibeThinker-3B

General customer-facing chat: the model is post-trained for verifiable reasoning, not warmth or conversational nuance. It will sound abrupt and miss context that a Qwen 3.6 Plus or Claude Sonnet wouldn't.
Multi-language work: the card lists language: en. Other languages may work via the Qwen2 base but are not target use cases.
Vision, audio, tool-use orchestration: text-only, no native multimodal support, and no agentic tool-use post-training in the published recipe (the Qwen2 chat template does support tool calls, but the model wasn't RL-tuned on them).
Open-domain knowledge questions: "who won the 2024 Champions League" type queries — small models cover the long tail of facts poorly by design. WeiboAI calls this out in their hypothesis statement.
Very long context (>64K): training window is 64K. Beyond that you're in extrapolation land.
Safety-critical reasoning without verification: WeiboAI's whole pitch is that small reasoners shine where there's a verifier. If you don't have a verifier (test runner, math checker, rubric), the safety margin shrinks.

FAQ

What is VibeThinker-3B?

A 3-billion-parameter open-weights language model from WeiboAI (Weibo's AI division), fine-tuned from Qwen2.5-Coder-3B, MIT-licensed, post-trained specifically for math, code, and STEM reasoning. Released on Hugging Face in June 2026.

Is the "3B = Opus 4.5" claim real?

Not as stated. The claim came from a community tweet by @TheAhmadOsman. WeiboAI's own claim is parity with frontier reasoning models like DeepSeek V3.2, GLM-5, and Kimi K2.5 — but only on verifiable reasoning benchmarks (IMO-AnswerBench, LeetCode contests, AIME). They make no claim about Claude Opus or about general intelligence. Treat the headline as marketing shorthand; treat the actual benchmarks as real.

Can it run on a Mac M1?

Yes. The full bf16 model is ~6 GB and runs on any M1 with ≥16 GB unified memory via the transformers library or MLX. A Q4_K_M GGUF is ~2 GB and runs comfortably even on an M1 with 8 GB RAM via llama.cpp.

Can it run on a phone?

Yes, with quantization. Q4_K_M (~2 GB) runs on high-end iPhones and flagship Android devices via llama.cpp or MLC LLM. Expect 5–15 tokens/second on current flagships. Sustained heavy generation will drain battery fast.

What runtime should I use?

For benchmark-matching production inference, use vLLM 0.10.1 or SGLang ≥0.4.9.post6 — those are what WeiboAI recommends and tested. For laptop use, llama.cpp with a GGUF quant. For quick experiments, plain HF transformers.

Can I quantize it?

Yes, freely. The MIT license places no restriction on quantization or redistribution. Community Q4 / Q5 / Q8 GGUF quants typically appear within hours of a trending HF release. For evaluation work, stick to bf16 to match the published benchmark numbers; for laptop use, Q4_K_M is the sweet spot.

How does it compare to Qwen2.5-Coder-7B?

Qwen2.5-Coder-7B is a stronger general coder; VibeThinker-3B is a stronger math/STEM reasoner per parameter. If your workload is pure code generation, the 7B may serve you better at 2× the size. If your workload is math-heavy reasoning or LeetCode-style algorithmic problems, VibeThinker-3B is the better fit at half the cost.

What sampling settings should I use?

temperature=1.0, top_p=0.95, top_k=-1. Lower temperatures degrade reasoning quality on this model — counter to most generalist models. The recipe matches what WeiboAI used to produce the published benchmark scores.

Does it support tool use / function calling?

The Qwen2 chat template supports the standard tool_call / tool_response XML pattern, so it will format tool calls correctly. But WeiboAI's RL pipeline did not specifically optimize for agentic tool use, so reliability on real tool-use traces is unverified. For agent loops, treat VibeThinker as the reasoning sub-component, not the orchestrator.

Is there a 1.5B version?

Yes — VibeThinker-1.5B preceded the 3B and is what introduced the Spectrum-to-Signal Principle (SSP) training pipeline. The 3B is a scale-up exploring how far the same recipe goes.

Where do I read the technical report?

arXiv 2606.16140 — "VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models" by Sen Xu et al., 2026.

Want the full picture? Read our Open-Source LLMs Landscape (2026) — the canonical guide to the open-weights ecosystem with every major model in this space ranked, compared, and updated quarterly. VibeThinker-3B is one of several specialist small reasoners reshaping the bottom of the parameter curve.

Need help evaluating or deploying small reasoning models?

Codersera connects you with vetted remote developers who ship LLM integrations daily — from local inference plumbing to agent inner-loop optimization. Hire a developer or partner with us.