DeepSeek V4

Qwen 3.7 vs DeepSeek V4: Best Open Coding Model in 2026?

DeepSeek V4 and Qwen 3.7 post near-identical coding benchmarks, but only one is actually open. A specifics-first comparison of architecture, local-run feasibility, API pricing, and license for developers choosing a coding model in 2026.

Published 30 Jun 2026 • Updated 30 Jun 2026 • 12 min read

DeepSeek V4 and Qwen 3.7 post near-identical vendor coding scores (SWE-bench Verified 80.6% vs 80.4%), but they are not both open. DeepSeek V4 ships open weights under the MIT license; Qwen 3.7 is closed and API-only. For an open, self-hostable coding model in 2026, DeepSeek V4 wins by default, and it costs roughly 9x less per token.

Two Chinese labs shipped flagship coding models inside a four-week window in spring 2026, and both immediately posted top-tier coding scores on their own vendor benchmarks. DeepSeek V4 arrived on April 24, 2026 in two open-weight variants, DeepSeek-V4-Pro and DeepSeek-V4-Flash, released under the MIT license with weights on Hugging Face. About a month later, around May 19, 2026, Alibaba released Qwen 3.7-Max (alongside a smaller qwen3.7-plus) through Alibaba Cloud Model Studio.

The framing that gets shared most often, "two open heavyweights for coding," is half wrong, and the correction is the whole story. DeepSeek V4 is genuinely open: download the weights, run them, fine-tune them, ship them in a product. Qwen 3.7 is not. Alibaba has not released the 3.7 weights, and there is no public sign the 3.7 line will be opened. So this is not a fight between two open models. It is a fight between an open model (DeepSeek V4) and a closed, API-only one (Qwen 3.7), and that distinction drives almost every decision below.

This comparison covers what actually shipped, the architecture split inside DeepSeek V4 that most posts gloss over, the head-to-head benchmark numbers (with the caveat that matters), local-run feasibility and VRAM, API pricing across regions, and what developers are reporting in real projects rather than on leaderboards.

What shipped, and when?

DeepSeek released V4 as a family, not a single model, and the two variants are very different beasts. DeepSeek-V4-Pro is a 1.6-trillion-parameter Mixture-of-Experts (MoE) model that activates roughly 49B parameters per token, drawn from 384 routed experts plus one shared expert (6 experts active per token). DeepSeek-V4-Flash is a much smaller 284B-parameter MoE that activates around 13B parameters per token. Both carry a 1M-token context window and a 384K maximum output, both are MIT-licensed, and both have weights published on Hugging Face. DeepSeek also offers both through its production API, so you can use them hosted or self-host them.

Qwen 3.7-Max is the proprietary flagship of Alibaba's Qwen 3.7 generation. It also offers a 1M-token context window, but its maximum output is 65,536 tokens, and crucially, Alibaba has not disclosed its architecture or parameter count. Whether it is dense or MoE, and how big it is, is unknown. You reach it only through the API as qwen3.7-max (or the cheaper qwen3.7-plus) on Alibaba Cloud Model Studio.

If you want the deeper single-model write-ups, we have dedicated guides for each: the DeepSeek V4 complete guide and the Qwen 3.7-Max launch guide. This piece is the cross-family comparison that sits alongside them.

Is Qwen 3.7 actually "open"?

No, and this is the single most important fact to get right before you choose one. The phrase "open coding model" gets attached to Qwen reflexively because earlier Qwen lines were genuinely open-weight and ran beautifully on local hardware. Qwen 3.7 broke that pattern.

Per a detailed status writeup from YottaLabs, "Qwen 3.7-Max is not open source and not open-weight." There are no weights on Hugging Face, no GGUF quants, and no way to load it into Ollama, LM Studio, or vLLM. Earlier Qwen lines were genuinely open-weight and ran well on local hardware; the 3.7 generation broke that pattern and stays API-only.

This has a direct consequence for at least one piece of content floating around (including, candidly, an older post on this very blog titled "how to run Qwen 3.7 locally"): you cannot run Qwen 3.7 locally, because the weights do not exist outside Alibaba's servers. If you want a local Qwen for coding, you are looking at an earlier open release, not 3.7. By contrast, DeepSeek V4 is open in the way the word is supposed to mean: MIT license, public weights, no usage gate.

So when this article asks "best open coding model," the honest answer for the open requirement is a one-horse race. Qwen 3.7 only competes if you drop the "open" constraint and compare it as a hosted API.

How do the architectures compare?

Three models, three very different hardware and cost profiles. The most common mistake is treating "DeepSeek V4" as one thing; the Pro and Flash variants sit at opposite ends of the deployment spectrum.

Spec	DeepSeek V4-Pro	DeepSeek V4-Flash	Qwen 3.7-Max
Released	Apr 24, 2026	Apr 24, 2026	~May 19, 2026
Architecture	MoE	MoE	Undisclosed
Total parameters	1.6T	284B	Undisclosed
Active params / token	~49B (6 of 384 + 1 experts)	~13B	Undisclosed
Context window	1M tokens	1M tokens	1M tokens
Max output	384K tokens	384K tokens	65,536 tokens
License	MIT (open weights)	MIT (open weights)	Proprietary (closed)
Weights available	Yes (Hugging Face)	Yes (Hugging Face)	No
Self-host	Multi-GPU only	High-end consumer (quantized)	Not possible

A few things to read out of that table. DeepSeek leaned hard into MoE for both variants, which is why the active-parameter count (the part that determines inference cost and speed) is so much smaller than the total. V4-Pro touches only ~49B of its 1.6T parameters per token; V4-Flash touches ~13B of 284B. That sparsity is what makes a 1.6T model affordable to serve at all and what makes the 284B Flash variant realistic to self-host once quantized, unlike Pro.

For Qwen 3.7-Max, the honest entry in every architecture cell is "undisclosed." Do not let anyone tell you it is a dense X-billion-parameter model or an MoE with Y experts; Alibaba has not said, and inventing a number would be guessing. What we can compare is behavior: context window (tied at 1M), output ceiling (DeepSeek's 384K is much larger than Qwen's ~64K), and openness (DeepSeek open, Qwen closed).

If MoE-versus-dense tradeoffs and the broader 2026 open-weight field are what you care about, the open-source LLMs landscape guide maps the whole field, and the self-hosting LLMs guide covers the serving side.

How do they score on coding benchmarks?

On paper, the two are effectively tied for coding. The catch, and it is a big one, is that every number below is each vendor's own published figure, not a single neutral evaluation run by a third party. Treat them as directional, not as a clean apples-to-apples scoreboard.

Benchmark	DeepSeek V4-Pro	Qwen 3.7-Max
SWE-bench Verified	80.6%	80.4%
SWE-bench Pro	55.4%	60.6%
LiveCodeBench	93.5%	91.6%
SWE-Multilingual	not published	78.3%
Terminal-Bench 2.0	not published	69.7%
Codeforces rating	3206	not published

The picture: DeepSeek V4-Pro edges Qwen on the two most-cited boards, SWE-bench Verified (80.6% vs 80.4%) and LiveCodeBench (93.5% vs 91.6%). Qwen 3.7-Max edges DeepSeek on the harder SWE-bench Pro (60.6% vs 55.4%) and publishes strong agentic and multilingual numbers (Terminal-Bench 2.0 69.7%, SWE-Multilingual 78.3%) that DeepSeek did not report in the same form. A 0.2-point gap on SWE-bench Verified is statistical noise; nobody should pick a model on it.

There is a sharper caveat worth internalizing. A US government evaluation (CAISI) assessed DeepSeek V4 across many domains and placed it roughly eight months behind the US frontier, around GPT-5 level overall, despite those top-of-leaderboard coding scores. An r/LocalLLaMA thread with 114 upvotes framed the tension well: coding benchmarks are "a narrow slice everyone optimizes against hardest." In other words, a model can saturate SWE-bench Verified and still trail on the messier, less-gamed work. High coding scores are necessary, not sufficient. We dig into the leaderboard-versus-reality gap further in the DeepSeek V4-Pro benchmarks review.

Can you run either model locally?

This is where the open-versus-closed split stops being philosophical and starts being a hardware spreadsheet. One model you can run on your own metal; the other you cannot run at all.

Model	Download size	Approx VRAM	Realistic hardware
DeepSeek V4-Pro	~865 GB	~400 GB (FP4)	Multi-GPU server, ~8x H100
DeepSeek V4-Flash	~160 GB	Varies with quantization	High-end consumer hardware (quantized)
Qwen 3.7-Max	Not available	n/a	No local path (weights not released)

DeepSeek V4-Pro is a data-center model, full stop. At roughly 865 GB to download and about 400 GB of VRAM even in FP4 quantization, you are looking at something like eight H100s. This is not a "run it on your laptop" model; it is a "run it in your own cluster instead of paying a vendor" model.

DeepSeek V4-Flash is the one that makes local interesting. At ~160 GB and ~13B active parameters, it is reportedly runnable on high-end consumer hardware once you quantize it (a high-memory Apple Silicon machine or a multi-GPU workstation). That is genuinely within reach for a well-funded indie or a small team that wants its coding model on-prem for privacy or cost reasons. If you are weighing the runners (Ollama, LM Studio, vLLM, llama.cpp, MLX), the Apple Silicon LLMs guide is the right companion for the Flash-on-a-Mac path.

Qwen 3.7-Max has no local path at all. No weights, no quants, no GGUF, no Ollama pull. If your requirement is "runs on hardware I control," Qwen 3.7 is disqualified before the conversation starts, and DeepSeek V4-Flash is the only one of these three that a typical team could realistically self-host.

How does API pricing compare?

If you are calling these models hosted, price is where the gap becomes a chasm. Below are the official per-million-token rates. DeepSeek's come straight from its published pricing; Qwen's reflect the current promotional rate.

Model	Input / 1M	Output / 1M	Cache hit / 1M
DeepSeek V4-Flash	$0.14	$0.28	$0.0028
DeepSeek V4-Pro	$0.435	$0.87	$0.003625
Qwen 3.7-Max (promo)	$1.25	$3.75	not published
Qwen 3.7-Max (list)	$2.50	$7.50	not published

Read it carefully. DeepSeek V4-Flash at $0.14 input / $0.28 output is roughly 9x cheaper than Qwen 3.7-Max's promotional $1.25 / $3.75, and the gap widens to almost 18x against Qwen's list price of $2.50 / $7.50. Even DeepSeek V4-Pro, the 1.6T flagship, undercuts Qwen 3.7-Max at $0.435 / $0.87. DeepSeek's cache-hit pricing ($0.0028 for Flash) is the kind of number that makes long-context, repeated-prompt agent loops nearly free on the input side.

Two caveats keep this honest. First, Qwen's $1.25 / $3.75 is a 50%-off promotion; the list price is $2.50 / $7.50, so budget against the list if you are planning past the promo window. Second, regional endpoints differ, Alibaba runs separate international (Singapore) and mainland-China endpoints, so where you call from can change the bill. DeepSeek's prices, by contrast, are flat and already at the floor of this market.

For integration, DeepSeek exposes an OpenAI-format base URL at https://api.deepseek.com and an Anthropic-format endpoint at https://api.deepseek.com/anthropic, with model IDs deepseek-v4-flash and deepseek-v4-pro (the legacy deepseek-chat / deepseek-reasoner aliases deprecate on 2026-07-24). Qwen 3.7 is reached as qwen3.7-max or qwen3.7-plus through Alibaba Cloud Model Studio.

What do real developers say in practice?

Benchmarks are the marketing layer; the interesting signal is what people report after wiring these into actual workflows. The picture is more nuanced than the leaderboards suggest, and it cuts both ways.

Cost stories favor DeepSeek, heavily. An r/DeepSeek post that picked up 100 upvotes documented a setup its author called "DeepClaude", running the full Claude Code agent loop on DeepSeek V4-Pro at roughly $0.44/M input and $0.87/M output, which the author pegged at "~95% cheaper than Anthropic" ($3 / $15). That is the headline use case for an open, cheap, capable model: keep the agent harness you like, swap the expensive model underneath. The AI coding agents guide covers how those harnesses route to different backends.

Qwen's edge is in agentic and multilingual work. Where Qwen 3.7-Max separates from DeepSeek on the published numbers is the harder, more agentic boards: its vendor-reported Terminal-Bench 2.0 (69.7%) and SWE-Multilingual (78.3%) scores are genuinely strong, and they pair with its lead on SWE-bench Pro (60.6% vs 55.4%). If your workload is long, tool-heavy agentic loops or multilingual codebases, those are the numbers that argue for Qwen, provided you are content on a closed, pricier API.

But cheaper is not automatically better output. The most useful counterweight came from an r/DeepSeek head-to-head (156 upvotes) building a karting game: DeepSeek V4-Pro came in roughly 4.3x cheaper than GPT-5.5, yet the author concluded "GPT-5.5 clearly made the better game... DeepSeek V4 Pro still felt far behind GPT-5.5." That is the CAISI "eight months behind the frontier" finding showing up in a concrete build. Both of these models are excellent value; neither is yet the model you reach for when output quality matters more than the invoice and you have a frontier option available.

Which should you pick for coding in 2026?

Strip away the leaderboard noise and it comes down to your hard constraints.

You need open weights, self-hosting, privacy, or fine-tuning. Between these two, DeepSeek V4 is the only real answer here, because Qwen 3.7 cannot be self-hosted at all. Pick V4-Flash if you want to run on your own high-end hardware; pick V4-Pro if you have multi-GPU capacity and want the bigger model. This is the core reason the "best open coding model" title resolves to DeepSeek.
You are cost-sensitive on a hosted API. DeepSeek again, and it is not close, V4-Flash is roughly 9x cheaper than Qwen 3.7-Max even before counting DeepSeek's near-free cache-hit pricing. The DeepClaude pattern (cheap model under a familiar agent harness) is the canonical move.
You run long agentic loops and want strong multilingual or terminal performance, and you are fine staying on a closed API. Qwen 3.7-Max has a real case here. Its SWE-bench Pro, Terminal-Bench 2.0, and SWE-Multilingual numbers are genuinely good. You just give up openness and pay materially more per token.
Output quality is the only thing that matters and budget is secondary. Be honest that, per independent evaluation and real builds, both of these still trail the Western frontier on the hardest tasks. Use them where their price-performance wins, not as a drop-in for a top-tier model on your most demanding work.

One more thing to watch: there is unsettled chatter (including a verified June 29, 2026 tweet) about a "bigger" or "official" DeepSeek V4 flagship arriving around mid-July 2026 after a preview period. Treat that as unconfirmed. What is confirmed and shipping today is V4-Pro and V4-Flash in DeepSeek's production API with MIT weights on Hugging Face, so do not let "a bigger one is coming" stop you from using what is already live.

If you want the adjacent matchup, we also compare these against the other strong open contender in GLM-5.2 vs DeepSeek V4 for coding.

The Codersera take

For most teams the practical answer in 2026 is a two-model stack rather than a single pick: DeepSeek V4 (Flash for cost, Pro for capacity) as the open, cheap workhorse for the bulk of coding and agent traffic, with a frontier model held in reserve for the hardest output-quality work where the benchmark gap actually bites. Qwen 3.7-Max earns a slot only if its long-loop agentic strength matches your workload and you are content staying on a closed API.

If you are standing up that kind of model stack, or hiring engineers who can wire open models into a real production pipeline rather than just running benchmarks, Codersera connects you with vetted remote developers who have shipped this work. That is the only pitch here; the rest of this post is for picking the right model.

FAQ

Is Qwen 3.7 open source?

No. Qwen 3.7-Max is closed and API-only. Alibaba has not released the weights, so there is no way to download, self-host, fine-tune, or run it locally, and there are no community quantizations or forks. Earlier Qwen lines were open-weight, but the 3.7 generation is not. DeepSeek V4, by contrast, is MIT-licensed with weights on Hugging Face.

Which is better for coding, DeepSeek V4 or Qwen 3.7?

On each vendor's own published benchmarks they are effectively tied: DeepSeek V4-Pro leads on SWE-bench Verified (80.6% vs 80.4%) and LiveCodeBench (93.5% vs 91.6%), while Qwen 3.7-Max leads on SWE-bench Pro (60.6% vs 55.4%). Because those are not from a single neutral evaluation, neither is a clear "winner" on the numbers. DeepSeek wins decisively on openness and price; Qwen has the edge in some long-agentic-loop reports.

Can I run DeepSeek V4 on my own hardware?

V4-Flash, realistically yes. At ~160 GB and ~13B active parameters, it can run on high-end consumer hardware once quantized (a high-memory Apple Silicon machine or a multi-GPU workstation). V4-Pro is data-center only, roughly 865 GB to download and ~400 GB of VRAM in FP4, which means about eight H100s. Qwen 3.7 cannot be run locally at all because no weights exist.

How much cheaper is DeepSeek V4 than Qwen 3.7 on the API?

DeepSeek V4-Flash costs $0.14 input / $0.28 output per million tokens, roughly 9x cheaper than Qwen 3.7-Max's promotional $1.25 / $3.75 and nearly 18x cheaper than Qwen's $2.50 / $7.50 list price. Even DeepSeek V4-Pro ($0.435 / $0.87) undercuts Qwen. Note Qwen's lower rate is a temporary 50%-off promo and its China endpoint is cheaper than Singapore.

What is the difference between DeepSeek V4-Pro and V4-Flash?

Both are MoE models with 1M-token context and MIT-licensed open weights. V4-Pro is the 1.6T-parameter flagship (~49B active per token), needs multi-GPU hardware, and posts the higher benchmark scores. V4-Flash is a 284B-parameter model (~13B active), cheap to serve, and the only variant a typical team can realistically self-host. Choose Flash for cost and local runs, Pro for maximum capability.

Are these models as good as GPT-5.5 or Claude Opus for coding?

For price-to-performance, often yes; for raw output quality on the hardest tasks, not consistently. An independent US evaluation placed DeepSeek V4 around eight months behind the frontier, and a real-world karting-game build found GPT-5.5 produced clearly better output than V4-Pro despite costing 4.3x more. Use DeepSeek V4 and Qwen 3.7 where their cost advantage wins, and keep a frontier model for work where quality outranks the invoice.