Qwen 3.5: The Complete Developer Guide (2026)

Why Qwen 3.5 has been quietly winning on cost-per-quality — variants, real benchmarks, hosted pricing, and self-hosting tradeoffs vs DeepSeek V4 and Llama 4.

Last updated: May 1, 2026.

Qwen 3.5 is the most quietly important open-weight release of the year. While the discourse has fixated on whether GPT-5.2 or Claude Opus 4.6 wins this week's Arena slot, Alibaba's Qwen team has shipped a full, eight-tier model family under Apache 2.0, paired with a hybrid Gated DeltaNet architecture, and undercut almost every closed-source competitor on cost-per-token by an order of magnitude. For engineering teams that care about cost discipline, self-hosting, and avoiding lock-in, Qwen 3.5 is the model family to understand in 2026.

This guide is the long version. We cover the full Qwen 3.5 family from the 0.8B edge model up to the 397B-A17B flagship and the Omni-Plus multimodal variant; the architecture and license; real benchmark scores against DeepSeek V4, Llama 4, and Gemma 4; pricing across DashScope and the major hosted providers; hardware requirements for self-hosting; how Qwen 3.5 plugs into Claude Code as a coding agent; and the limitations you need to know before staking production on it.

TL;DR

  • The family spans Qwen3.5-0.8B, 2B, 4B, 9B, 27B (dense), 35B-A3B and 122B-A10B (MoE), and the Qwen3.5-397B-A17B flagship, plus the Qwen3.5-Omni-Plus multimodal variant. All open-weight under Apache 2.0 except the hosted-only Plus tier.
  • Architecture is a hybrid: roughly 75% Gated DeltaNet linear-attention layers and 25% full softmax attention with GQA/RoPE, plus sparse MoE on the larger tiers. This is what gets you 256K native context with sane KV cache cost.
  • Benchmarks: the 397B-A17B flagship hits 88.4 GPQA Diamond, 91.3 AIME 2026, 83.6 LiveCodeBench v6, and 86.7 Tau2-Bench (agents). The 9B model scores 81.7 on GPQA Diamond on a laptop.
  • Pricing on DeepInfra runs $0.01 / $0.05 per 1M tokens for the 0.8B, scaling to $0.54 / $3.40 per 1M for the 397B-A17B. DashScope offers an Anthropic-API-compatible endpoint that drops directly into Claude Code.
  • Cost-per-quality is the headline. Qwen has been quietly winning in production because the 35B-A3B and 9B tiers hit 80-90% of frontier-model accuracy at 1-5% of the cost, with permissive licensing and self-hosting on commodity GPUs.
  • The catch: ecosystem maturity around fine-tuning and agent tooling lags Llama, the Plus and Max tiers are not open-weight, and the Omni-Plus real-time speech features still have rough edges in non-English languages.

The Qwen 3.5 family at a glance

Qwen 3.5 was released in three waves between February 16 and March 2, 2026. Unlike Qwen 3, which split text models from VL (vision-language) models into separate trees, Qwen 3.5 unifies them: a single backbone trained with early fusion of text and multimodal tokens. In Alibaba's own evaluations, this unified approach matches or beats the separate Qwen3-VL line on visual benchmarks while keeping text performance intact.

Model Total params Active params Architecture Context License Best for
Qwen3.5-0.8B0.8B0.8BDense256KApache 2.0Edge, CPU inference, on-device classification
Qwen3.5-2B2B2BDense256KApache 2.0Mobile, embedded agents, sub-2GB GPU
Qwen3.5-4B4B4BDense256KApache 2.0Local coding agent, 6-8GB VRAM
Qwen3.5-9B9B9BDense256KApache 2.0Reasoning on 12-16GB VRAM laptops
Qwen3.5-27B27B27BDense256KApache 2.0Single-GPU prod inference
Qwen3.5-35B-A3B35B3BMoE + DeltaNet256KApache 2.0Sweet spot for cost-per-quality
Qwen3.5-122B-A10B122B10BMoE + DeltaNet256KApache 2.02x H100 production deployments
Qwen3.5-397B-A17B397B17BMoE + DeltaNet256KApache 2.0Flagship reasoning, agentic workflows
Qwen3.5-Omni-Plus~100B*Thinker-Talker MoE256KHosted (DashScope)Multimodal: text, image, audio, video
Qwen3.5-Plus / MaxHosted256K-1MHosted (DashScope)Closed-tier production tier

*Omni-Plus active-parameter count has not been published; the unified Thinker-Talker MoE serves text, image, audio, and video through one backbone.

The MoE tiers (35B-A3B, 122B-A10B, 397B-A17B) follow the now-standard pattern: a large total parameter pool with sparse routing so only a fraction of weights activate per token. The naming convention is {total}-A{active}; 35B-A3B activates 3B parameters per forward pass out of 35B total, which is what makes it run faster and cheaper than a dense 9B while scoring closer to a dense 35B on quality.

If you want a hands-on tour of the smallest tier, our walkthrough on running and benchmarking Qwen3.5-0.8B covers latency, RAM footprint, and where the 0.8B actually wins in production.

Architecture: the hybrid that changes the cost curve

The single biggest reason Qwen 3.5 is cheaper to serve than its competitors is that most of its layers are not standard softmax attention. Qwen 3.5 interleaves two layer types: full attention with grouped-query attention (GQA) and RoPE, and linear attention layers built on Gated DeltaNet. By default, every fourth layer is full attention. The other three are linear.

The trade-off matters. Full softmax attention is O(n²) in sequence length and requires a KV cache that grows linearly with context. Gated DeltaNet is O(n) and does not grow the KV cache at all - it maintains a fixed-size state vector that gets updated with a delta rule plus exponential gating. Stack three DeltaNet layers per softmax layer and you get most of the recall capability of a full-attention model with a fraction of the inference memory.

That is why Qwen 3.5 can offer a native 256K context window on a 9B model that fits in 12GB of VRAM. A pure-softmax model with the same context would need several times more memory just for the KV cache.

On top of this hybrid attention, the larger tiers add sparse Mixture-of-Experts. The 397B-A17B model has 397B total weights but only routes 17B of them per token. Combined with the linear-attention layers, this is what produces Alibaba's claim of 8-19x faster decoding than the prior Qwen3-Max generation at roughly 60% lower cost.

Benchmarks: where Qwen 3.5 actually lands

The flagship Qwen3.5-397B-A17B competes with frontier closed models on reasoning and agentic benchmarks. The mid-tier 35B-A3B and 27B compete with the best 30-70B models. And the small tier - 4B and 9B - are punching above their weight class because of the unified vision-language training.

Benchmark Qwen3.5-397B-A17B DeepSeek V4-Pro Llama 4 Maverick Gemma 4 31B
GPQA Diamond88.487.978.283.6
AIME 202691.399.472.089.2
LiveCodeBench v683.6~8562.471.0
SWE-bench Verified~78 (flagship), 73.4 (35B-A3B)83.7~55~58
MMLU-Pro88.092.880.584.2
Tau2-Bench (agents)86.7~80~70
MMMU (multimodal)85.073.476.8
ArtificialAnalysis Intelligence Index v4.045~48~35~40

Some takeaways that don't show in the headline numbers:

  • DeepSeek V4 still leads on raw coding (SWE-bench, AIME 2026 math) because it was trained explicitly for those workloads with the Manifold-Constrained Hyper-Connections architecture. If your use case is whole-codebase refactoring, DeepSeek V4 is the right pick. Our DeepSeek V4 complete guide covers that side in depth.
  • Qwen 3.5 leads on agentic workflows. Tau2-Bench at 86.7 is second only to Claude Opus 4.6 (91.6). The Gated DeltaNet memory persistence is what shows up here - long multi-step tool use with consistent state is exactly what linear attention with delta-rule updates is good at.
  • The small tier is the surprise. The 9B model scoring 81.7 on GPQA Diamond is genuinely new. As of early 2026 there was no other model under 30B that broke 80 on Diamond. ArtificialAnalysis confirms the 9B at index 32 and the 4B at 27, both step changes over the Qwen 3 generation.

For a head-to-head with the next-most-popular open-weight family at a comparable size, our Gemma 3 vs Qwen 3 comparison remains a useful baseline; the 3.5 generation widens Qwen's lead on multilingual and long-context, narrows the gap on math, and stays roughly level on general reasoning.

Pricing: DashScope, DeepInfra, Together, Fireworks

Qwen 3.5 is available on essentially every meaningful inference provider. The headline numbers vary by 2-3x depending on host, with DeepInfra typically the cheapest and DashScope the most feature-complete.

Model DeepInfra (in / out per 1M) Together (blended) DashScope Notes
Qwen3.5-0.8B$0.01 / $0.05tieredDeepInfra is the only major host
Qwen3.5-2B$0.02 / $0.10tiered
Qwen3.5-9B~$0.04 / $0.18$0.11/M blendedtieredTogether adds OpenAI-compat layer
Qwen3.5-35B-A3B$0.10 / $0.45~$0.20 blendedtieredThe cost-per-quality sweet spot
Qwen3.5-122B-A10B$0.30 / $1.50~$0.60 blendedtiered
Qwen3.5-397B-A17B$0.54 / $3.40~$1.20 blendedtieredRoughly 1/10th of GPT-5.2 pricing
Qwen3.5-Omni-PlusHosted-onlyAudio + video tokens billed separately

DashScope's pricing is tiered by request size rather than flat per-token. Small requests (under 32K tokens) get the lowest rate; long-context calls cost progressively more. This penalizes naive RAG pipelines that stuff context, but rewards well-designed prompts. The DashScope international endpoint is at dashscope-intl.aliyuncs.com; the China-mainland endpoint is at dashscope.aliyuncs.com. They are not interchangeable - keys are scoped to one region.

For teams running real workloads, the practical cost story is: a 50,000-document daily classification job that costs roughly $4,000/month on a frontier closed model lands at around $200/month on Qwen3.5-35B-A3B via DeepInfra with comparable accuracy. That is the cost-per-quality story driving Qwen adoption in 2026.

Self-hosting: hardware requirements

One of the points of Apache 2.0 weights is that you don't have to use a hosted provider at all. Qwen 3.5 is friendly to self-hosting because the linear-attention layers keep the KV cache small.

Model FP16 / BF16 Q4_K_M (GGUF) Realistic GPU
0.8B~2 GB~0.6 GBAny laptop, CPU-only OK
2B~5 GB~1.5 GBRTX 3060 / M1
4B~9 GB~3 GBRTX 3060 12GB / M2
9B~18 GB~5.5 GBRTX 4060 Ti 16GB / M2 Pro
27B~54 GB~15 GBRTX 4090 24GB (Q4) / 2x 3090
35B-A3B~70 GB~22 GBRTX 5090 32GB / Mac Studio 64GB
122B-A10B~245 GB FP8: ~123 GB~70 GB2x H100 80GB FP8
397B-A17B~795 GB FP8: ~397 GB~220 GB4-8x H100 / 2-4x H200

Three tools cover almost all self-hosting scenarios:

  • Ollama for the simplest path. ollama pull qwen3.5:9b and you are running. Best for laptops and dev boxes.
  • llama.cpp for maximum throughput on CPU or Apple Silicon, and for GGUF-quantized MoE deployments where the experts can be offloaded selectively. Our CPU-only setup guide for the 0.8B with Ollama shows what's possible without any GPU at all.
  • vLLM for production inference servers with continuous batching, paged attention, and prefix caching. Native support for Qwen 3.5's hybrid architecture landed in vLLM 0.7+.

Qwen 3.5 as a coding agent (with Claude Code)

The most underappreciated angle on Qwen 3.5 is that it speaks the Anthropic Messages API. Alibaba ships a Claude-Code-compatible endpoint at https://dashscope-intl.aliyuncs.com/apps/anthropic, which means you can point Claude Code, Cline, or any Anthropic-SDK client at Qwen and run agentic coding workflows without writing any glue.

The set-up is short. Set ANTHROPIC_BASE_URL to the DashScope Anthropic-compat URL, set ANTHROPIC_API_KEY to your DashScope key, and pick a model name like qwen3.5-coder-plus. Claude Code starts a session against Qwen and you get the same TUI, tool-use, and file-edit primitives - just with Qwen doing the thinking.

For local-only setups, the same trick works against an OpenAI-compatible endpoint served by llama.cpp or vLLM. Our walkthrough on running Qwen 3.5 with Claude Code as a free local coding agent covers both the API-compat path and the fully-offline path with a llama.cpp endpoint, including the prompt-template gotchas that come up when Claude Code's tool-call format meets Qwen's chat template.

The practical answer to "is Qwen 3.5 good enough for agentic coding?" is: at the 35B-A3B tier and above, yes for most everyday work; at the 4B-9B tier, yes for tightly-scoped tasks like single-file refactors, test generation, and codemod work where you don't need deep cross-file reasoning.

Qwen3.5-Omni-Plus and the VL story

The Omni-Plus variant is Alibaba's answer to GPT-4o-realtime and Gemini 3.1 Pro: a single model that natively handles text, image, audio, and video, with streaming speech generation through a Thinker-Talker architecture. The Thinker is a Hybrid-Attention MoE that processes all modalities through a unified backbone with a vision encoder for images and an audio tokenizer for sound. The Talker generates contextual speech in a streaming pipeline for real-time interaction.

Notable specs: 256K context that can hold over 10 hours of continuous audio or 400 seconds of 720p video at 1 fps with audio; 113 supported languages and dialects, including 39 Chinese dialects; and SOTA results across 215 audio and audio-visual benchmarks per Alibaba's own evaluation. The Realtime API supports semantic interruption detection, native tool use, and voice cloning.

For a head-to-head with the closed multimodal field, our Omni-Plus vs GPT-4o vs Gemini 3.1 Pro comparison walks through the cases where Omni-Plus wins (Chinese audio understanding, long-form video analysis, cost-per-minute of audio) and where it still trails (English real-time conversation latency, tool-use chaining over voice).

If you specifically need a vision-language deployment without the audio pipeline, the older Qwen3-VL line is still in active use; our Qwen3-VL-30B-A3B-Thinking deployment guide remains accurate for teams that want the document-understanding and OCR capability without paying for the Omni stack.

Known limitations

The marketing rounds usually skip these. They matter.

  • Plus and Max tiers are not open-weight. Qwen3.5-Plus and the implied Qwen3.5-Max sit on DashScope only. If your strategy assumes open weights, the open ceiling is 397B-A17B, not Max.
  • Hybrid attention has rough edges in long-context recall. The 75/25 DeltaNet/softmax mix is excellent on average, but on adversarial needle-in-a-haystack tests at 128K+ it underperforms pure-softmax models of comparable size by a few points. Production RAG pipelines that depend on exact verbatim retrieval over very long contexts should benchmark on real data first.
  • Tool-use formatting is finicky outside DashScope. When you self-host, the exact chat template and the way you serialize tool calls matters. Mismatches between Qwen's expected XML-style tool-call format and Claude Code's expected schema have caused production-blocking bugs that took weeks to diagnose. Pin your inference engine version.
  • The fine-tuning ecosystem is still maturing. Llama has years of LoRA recipes, axolotl configs, and proven SFT corpora. Qwen 3.5 has fewer of those because it's recent. Expect to do more of your own scaffolding for training.
  • Omni-Plus's non-English real-time speech still ships rough edges in Indic languages, Latin American Spanish, and several African languages despite the headline 113-language count.
  • SWE-bench Verified at the flagship tier still trails DeepSeek V4. If raw coding benchmark numbers drive your selection, DeepSeek V4 is the sharper pick for that single dimension.
  • Data residency and governance. DashScope endpoints route to Alibaba Cloud regions in Singapore, Hong Kong, or mainland China. Regulated industries with EU or US-only data residency requirements should self-host or use a Western inference provider rather than DashScope directly.

FAQ

Is Qwen 3.5 free to use commercially?

Yes for the open-weight tiers (0.8B through 397B-A17B), under Apache 2.0 - no MAU limit, no acceptable-use restrictions, no notice-of-attribution gates. The Plus and Max hosted tiers are paid and governed by DashScope's commercial terms.

What is the difference between Qwen 3.5 and Qwen 3?

Qwen 3.5 unifies text and vision-language into a single backbone (Qwen 3 had separate VL models), introduces the hybrid Gated DeltaNet + softmax architecture, extends native context to 256K, and ships an MoE flagship at 397B total / 17B active. Small-tier intelligence index gains are 9-15 points over the Qwen 3 equivalents.

How does Qwen 3.5 compare to DeepSeek V4?

DeepSeek V4 leads on raw coding and pure-math benchmarks. Qwen 3.5 leads on agentic tasks, multilingual workloads, and the small-tier (under 10B) intelligence frontier. Qwen also ships a wider family with smaller open-weight options.

Can I run Qwen 3.5 on a laptop?

Yes. The 9B model at Q4 quantization needs about 5.5 GB of VRAM. A 16 GB MacBook Pro or any 8 GB-VRAM discrete GPU runs it comfortably with room for context.

Does Qwen 3.5 support function calling?

Yes. All instruct-tuned variants from 4B upward support function calling and tool use. The 397B-A17B and 35B-A3B are the strongest at multi-turn agentic tool sequences.

Can Qwen 3.5 replace Claude Code's underlying model?

Yes. Claude Code accepts an ANTHROPIC_BASE_URL override, and DashScope ships an Anthropic-Messages-compatible endpoint. You can also point Claude Code at a local llama.cpp or vLLM endpoint serving Qwen via an OpenAI-compatible shim.

Is Qwen 3.5 multilingual?

Yes - 201 languages on the text models. Omni-Plus adds 39 Chinese dialects and roughly 74 spoken-language coverage on the audio side.

What is the context window?

256K tokens native across the entire 3.5 family. The hybrid linear attention is what makes that affordable.

Why does Qwen call its big model "397B-A17B" instead of just 397B?

The "A17B" denotes active parameters - the count that actually fires per token. Out of 397B total weights, MoE routing activates 17B. Inference cost scales with active parameters, not total, which is why a 397B model can be served at a fraction of a 397B-dense-model price.

Should I use the Plus/Max hosted tier or the open-weight 397B-A17B?

If you need the absolute best Alibaba can ship and are happy with a hosted model, use Plus/Max. If you need open weights, on-prem deployment, fine-tuning rights, or data residency control, use the 397B-A17B open-weight model.

How much does it cost to run Qwen 3.5 in production?

Hosted: roughly $0.10-$3.40 per 1M output tokens depending on tier. Self-hosted on a 2x H100 box for the 122B-A10B: roughly $4,000/month bare-metal, which becomes cost-effective above approximately 200M tokens/day of usage.

Does Qwen 3.5 have a "thinking" or reasoning mode?

Yes. Most tiers ship Instruct and Thinking variants. The Thinking variants emit chain-of-thought traces and score higher on reasoning benchmarks at the cost of longer generation latency.

Is Qwen 3.5 safe to deploy in regulated environments?

The weights are open and inspectable - that's a positive for compliance. But DashScope hosted endpoints route through Alibaba Cloud regions, which often does not satisfy EU / US-only data residency requirements. Self-hosting or Western providers like DeepInfra, Together, and Fireworks are the standard path for regulated workloads.

What's the practical cost difference vs GPT-5.2 or Claude Opus 4.6?

Order-of-magnitude. The Qwen3.5-397B-A17B at $0.54 / $3.40 per 1M tokens is roughly 1/10th to 1/15th the price of the leading frontier closed models for comparable agentic and reasoning quality.

Next steps

Picking and deploying an open-weight model is the easy part. The hard part is the engineering around it: data pipelines, fine-tuning recipes, agent scaffolding, eval harnesses, and the production observability to know when the model regresses on your real workload. This is the kind of work that bottlenecks teams for months when done part-time alongside everything else.

If you're scaling Qwen 3.5, DeepSeek V4, or any open-weight model into production and you need senior engineers who already know the stack - vLLM, llama.cpp, LoRA fine-tuning, eval design, and the agent-tooling glue - Hire a Codersera-vetted Python or ML engineer. Codersera-vetted developers are remote-ready, technically screened, and available on a risk-free trial so you can validate fit before committing.