Qwen 3.7 vs Kimi K2.7: Best Open Agentic Coder in 2026?
Two flagship "agent-era" coders landed within weeks of each other in mid-2026. Alibaba unveiled Qwen3.7-Max at its Cloud Summit around May 20, 2026, branding it a "flagship for the Agent Era." Moonshot AI answered on June 12, 2026 with Kimi K2.7 Code, an open-weights agentic specialist. Both are roughly trillion-parameter mixture-of-experts (MoE) models that converge on the same 32B active-parameter sweet spot, and both are pitched squarely at long-horizon, tool-using autonomous coding rather than one-shot snippet generation.
The headline question — "which is the best open agentic coder?" — has a cleaner answer than the marketing implies, because only one of these models is actually open. That single fact reshapes the entire comparison, so we'll deal with it head-on before getting into benchmarks, pricing, local feasibility, and which one belongs in your agent loop.
Why does the "open" framing change the whole comparison?
Start with the correction most write-ups skip: Qwen3.7-Max is closed-source. It is API-only, served through Alibaba Cloud Model Studio, and there are no Qwen 3.7 open weights as of late May 2026. The open Qwen line is still Qwen 3.6 — specifically Qwen3.6-35B-A3B and Qwen3.6-27B, both Apache 2.0. So if your requirement is genuinely "open weights I can download and self-host," Qwen 3.7 does not qualify; the open Qwen you'd actually run is the previous generation.
That gap matters in practice: developers hunting for downloadable Qwen 3.7 weights won't find any — the model lives only behind Alibaba's API, per the official Qwen 3.7 announcement. If "open" is a hard requirement, Qwen 3.7 is off the table no matter how it benchmarks.
Kimi K2.7 Code, by contrast, is open-sourced under a Modified MIT license, with the full ~1.06T-parameter MoE weights published on Hugging Face. That makes the honest framing for this post not "open A vs open B" but closed flagship API (Qwen) vs open-weights agentic specialist (Kimi). On openness alone, Kimi wins by default — but "you can download it" and "you can realistically run it" are very different claims for a trillion-parameter model, which we'll get to.
A quick note on version naming, because these names get mixed up constantly and it leads to wrong conclusions. There are four distinct releases in play: Qwen 3.6 (open, Apache 2.0 — the current open Qwen line), Qwen 3.7-Max (closed flagship, the model in this post), Kimi K2.6 (the prior Kimi generation), and Kimi K2.7 Code (June 12, 2026). When someone says "Qwen 3.7 is open" they almost certainly mean Qwen 3.6; when you see a "Kimi" benchmark, check whether it's K2.6 or K2.7. Getting the version wrong is the single most common way these comparisons go sideways.
What is Qwen3.7-Max?
Qwen3.7-Max is Alibaba's proprietary flagship, positioned explicitly for autonomous, long-horizon agent work rather than chat. Key facts:
- Access: closed, API-only via Alibaba Cloud Model Studio (also surfaced through third-party aggregators like OpenRouter and Together).
- Context: 1M tokens — four times Kimi's window, which matters for whole-repo reasoning and very long agent traces.
- Modality: text-only input and output.
- Scaffold-agnostic: Alibaba says it runs under Claude Code, OpenClaw, and Qwen Code, so you aren't locked into one harness.
The autonomy claims are the centerpiece of Alibaba's pitch. The team (via @Alibaba_Qwen and @ziwenxu_) reports Qwen3.7-Max ran 35 hours solo on a memory-bound attention-kernel optimization on undocumented hardware, made 1,158 tool calls with zero hand-holding, and landed a ~10x speedup. In a year-long "YC-bench" startup simulation it generated $2.08M in simulated revenue. Read these as directional demos of stamina, not reproducible benchmarks — but they do speak to where Qwen is aimed: loops that run for hours, not turns that run for seconds.
For a deeper setup and access walkthrough, see our Qwen 3.7-Max launch guide; if you specifically want open Qwen weights you can self-host today, that's the 3.6 generation covered in how to run Qwen 3.6 locally.
What is Kimi K2.7 Code?
Kimi K2.7 Code is Moonshot AI's open agentic-coding release, dated June 12, 2026. It's the successor to Kimi K2.6 and is explicitly tuned for the agent loop. The spec sheet:
- Architecture: ~1.06 trillion total parameters, 32B active, 61 layers, 384 experts (8 routed + 1 shared) — a sparse MoE in the same weight class as Qwen but with a quarter of the context.
- Context: 256K tokens (262,144).
- License: Modified MIT, weights on Hugging Face.
- Modality: adds native image and video input over K2.6.
- API: Anthropic-compatible — you point an Anthropic SDK at Moonshot's endpoint and existing Claude-shaped tooling mostly just works.
Moonshot reports meaningful generational gains: hallucination rate down to roughly 39% from K2.6's ~65%, and about 30% fewer reasoning tokens per task — a real efficiency win if it holds, since reasoning tokens are pure cost. There's also a HighSpeed (6x) mode: an independent test from @gmi_cloud (333 likes, ~37K views) found HighSpeed ran 2.5x-4.2x faster than standard K2.7 Code at comparable quality, and 4.5x-15x faster than GLM 5.2 at roughly half the price. That speed/cost envelope is arguably Kimi's strongest practical card.
How do Qwen3.7-Max and Kimi K2.7 Code compare on specs?
Here's the side-by-side. The single biggest divergences are access model (closed vs open), context window (1M vs 256K), and modality.
| Attribute | Qwen3.7-Max | Kimi K2.7 Code |
|---|---|---|
| Vendor | Alibaba | Moonshot AI |
| Released | ~May 20, 2026 | June 12, 2026 |
| Access | Closed, API-only (Alibaba Cloud Model Studio) | Open weights (Hugging Face) |
| License | Proprietary | Modified MIT |
| Architecture | ~1T-class MoE, 32B active | ~1.06T MoE, 32B active, 61 layers, 384 experts (8+1 shared) |
| Context window | 1M tokens | 256K (262,144) tokens |
| Modality | Text in / text out | Text + native image & video input |
| API shape | Alibaba Cloud / OpenRouter / Together | Anthropic-compatible |
| Self-hostable? | No | Yes (but ~577GB VRAM at INT4 — see below) |
Note the symmetry: both are ~1T-class MoE models with the same 32B active-parameter budget. They were architected for the same job. The decision between them is rarely about the transformer math — it's about access, context, price, and how each behaves once it's actually running your agent loop.
What do the benchmarks actually say?
Read this section with one rule front of mind: almost every headline number on both sides is vendor-reported on proprietary suites. Kimi's figures come from Moonshot's own Kimi Code Bench v2, Program Bench, MLS Bench, and MCP-Mark. Qwen's SWE-Pro and related figures are Alibaba's own. kimik2ai.com is explicit that no independent SWE-bench Verified, Terminal-Bench, or LiveCodeBench results for K2.7 exist yet — "read them as vendor-reported and directional." Treat the tables as marketing-adjacent, not referee-grade.
Qwen3.7-Max's published numbers:
| Benchmark (Qwen3.7-Max, vendor-reported) | Score |
|---|---|
| SWE-Pro | 60.6 |
| SWE-bench Verified | 80.4 |
| SWE-Multilingual | 78.3 |
| Terminal-Bench 2.0-Terminus | 69.7 |
| GPQA Diamond | 92.4 |
| HLE (Humanity's Last Exam) | 41.4 |
| MCP-Mark (agentic tool-use) | 60.8 |
| MCP-Atlas | 76.4 |
| SciCode | 53.5 |
Kimi K2.7 Code's published numbers are reported mostly as deltas over K2.6, which is useful for seeing the generational jump but means you can't line them up against Qwen's suites directly:
| Benchmark (Kimi K2.7 Code, vendor-reported) | K2.7 | K2.6 | Delta |
|---|---|---|---|
| Kimi Code Bench v2 | 62.0 | 50.9 | +21.8% |
| Program Bench | 53.6 | 48.3 | +11.0% |
| MLS Bench Lite | 35.1 | 26.7 | +31.5% |
| MCP-Mark Verified | 81.1 | — | beats Claude Opus 4.8's 76.4 |
| MCP Atlas | 76.0 | — | — |
| SWE-bench Pro (uncertain, secondary) | ~58.6 | — | directional only |
The cleanest near-apples-to-apples comparison is MCP Atlas, where both vendors report a number on a similarly-named suite: Qwen 76.4 vs Kimi 76.0 — effectively a tie on agentic tool-orchestration. The MCP-Mark figures look more lopsided (Qwen 60.8 vs Kimi's 81.1) but Kimi's is the "Verified" variant, which may not be the same test, so don't over-read the gap. The headline takeaway: Qwen posts the higher and broader raw scores (GPQA 92.4 and SWE-bench Verified 80.4 are genuinely strong), while Kimi's story is steep generational improvement plus a standout MCP-Mark Verified result. Neither has independent confirmation yet.
The SWE-bench Pro ~58.6 figure for Kimi that circulates in secondary write-ups deserves a specific warning: kimik2ai.com itself says no independent SWE-bench Verified exists for K2.7. So that number is doubly soft — secondary and vendor-directional. Don't anchor a purchasing decision on it.
How does pricing compare — and what does it actually cost?
List price and real cost diverge sharply here, so we'll do both. There's no single "the price" on either side; it's tier- and provider-dependent.
| Pricing (per 1M tokens) | Qwen3.7-Max | Kimi K2.7 Code |
|---|---|---|
| Input (official tier) | ~$2.50 | $0.95 (cache-miss) |
| Input (cached) | — | $0.19 |
| Output | ~$7.50 | $4.00 |
| Cheaper third-party | Novita ~$1.25 / $3.75 | via OpenRouter aggregator |
| Subscription | — | Kimi Code from $19/mo (a ~$39 higher-limit plan also exists) |
On official list pricing, Kimi is roughly 3-5x cheaper than Qwen's official tier on input, and its $0.19 cached-input rate is especially low — a big deal for agent workflows that re-send the same large system prompt and repo context on every step. The gap narrows on output and on Qwen's discounted third-party routes: Qwen's official ~$2.50/$7.50 drops to roughly $1.25/$3.75 on providers like Novita, so discounted-Qwen output (~$3.75) can actually undercut Kimi's $4.00 list output. Where Kimi pulls clearly ahead is input — especially cached input — and, as we'll see next, on effective cost once Qwen's verbosity is in the mix.
Then there's the verbosity tax. DataCamp's evaluation flagged Qwen3.7-Max as notably verbose — one eval generated ~97M tokens against a ~24M median. Output is the expensive side of the meter, so a model that emits 4x the tokens at $7.50/1M can cost far more in practice than its list price suggests. Kimi's reported ~30% reduction in reasoning tokens vs K2.6 pushes the other way. So the effective-cost gap between the two is likely wider than the sticker gap. Always measure on your own traces before committing a budget.
One note on the subscription side: the Kimi Code plans carry weekly usage limits — the entry tier starts at $19/mo, with a higher ~$39 weekly-limit plan above it — so they're not "unlimited" buckets for heavy agent use. Past a certain volume, metered API billing may work out better.
Can you actually run Kimi K2.7 Code locally?
This is where "open" needs a reality check. Yes, the weights are downloadable. No, a consumer GPU will not run them. The VRAM math, per Unsloth's guidance:
| Precision | Approx. memory | Rough hardware |
|---|---|---|
| FP16 | ~2,300GB | Data-center cluster |
| INT8 | ~1,150GB | Large multi-GPU node |
| INT4 | ~577GB | ≈8x A100 80GB |
| Unsloth UD-Q2_K_XL (dynamic 2-bit) | ~345GB combined RAM+VRAM | Practical "at home" floor — single-digit tok/s, reduced quality |
So the honest statement is: Kimi K2.7 is self-hostable only if "local" means a serious workstation or a small GPU cluster. Even INT4 needs around 577GB of VRAM — roughly eight A100 80GBs. The Unsloth dynamic 2-bit build (UD-Q2_K_XL, ~345GB combined RAM+VRAM) is the realistic floor for an enthusiast with a maxed-out box, and you should expect single-digit tokens per second with measurable quality loss at that quant. A 24GB consumer card runs none of this. If you want a genuinely runnable open model on modest hardware, the right move is a smaller architecture entirely — see our self-hosting LLMs guide for what fits where.
For people who do have the hardware, the practical commands are:
# Easiest path — hosted API (Anthropic-compatible)
# Kimi K2.7 Code exposes an Anthropic-compatible API, so you point an
# Anthropic-style SDK at Moonshot's endpoint using your Kimi API key.
# Use the exact base URL from Moonshot's current API docs.
# Local, balanced quant (Unsloth UD-Q2_K_XL, ~345GB combined RAM+VRAM):
# download the GGUF build and load it in your preferred GGUF runtime.
# Rule of thumb: RAM + VRAM needed ~= the quant file size.
# Full INT4 (~577GB VRAM, ~8x A100 80GB):
# run it on a multi-GPU inference server.
The takeaway: Kimi's openness is real and valuable for organizations that need on-prem control, data residency, or freedom from vendor rate limits — but it is not a "run it on your laptop" story. For most teams, the Anthropic-compatible hosted API is how you'll actually use it.
Which is better for agentic tool-use and long-horizon autonomy?
This is the category both models were built to win, so it deserves more than a benchmark glance.
Qwen3.7-Max leans on stamina. The 35-hour, 1,158-tool-call solo run and the YC-bench simulation are stamina demos — the model is engineered to stay coherent across very long autonomous loops, and the 1M context window directly supports that (more room for accumulated tool output and history before truncation bites). If your use case is "set it running and let it grind on a hard problem for hours," that's Qwen's claimed strength — bearing in mind these are vendor demos, not reproducible benchmarks.
Kimi K2.7 leans on speed and tool-orchestration economics. Its MCP-Mark Verified 81.1 (reportedly beating Claude Opus 4.8's 76.4) is the standout agentic number, and the HighSpeed mode's 2.5x-4.2x speedup at comparable quality changes the iteration loop — faster cycles plus ~3-5x lower token cost means you can afford more attempts. If your agent does lots of MCP tool calls and you care about throughput and budget, Kimi's profile fits. Pair it with a solid tool layer from our best MCP servers for Claude Code and Cursor roundup.
Now the calibrated skepticism, because the benchmarks don't tell the whole story. Every headline number here is vendor-reported, and as kimik2ai.com itself notes, no independent SWE-bench Verified, Terminal-Bench, or LiveCodeBench results for K2.7 exist yet. A model two weeks post-release simply hasn't built the real-world track record that turns lab scores into trust — so the strong benchmark sheet reads as a promise, not a proven result. Treat both models' "agent era" branding as a hypothesis to test on your own code, not a verdict to bank on.
And this is why running your own eval matters more than the scoreboard. The agentic gap between models shows up most in the unglamorous parts of a long run — recovering from a failed tool call, not re-litigating a decision it already made, knowing when to stop. Those behaviors barely register on a single benchmark score but dominate the experience of letting a model run unsupervised for an afternoon. The only reliable test is to hand each model the same real task in the same harness and watch where it gets stuck.
If you're choosing an agent harness and model together, our AI coding agents guide covers how scaffold choice interacts with model behavior — and both Qwen3.7-Max and Kimi K2.7 are deliberately scaffold-agnostic, so you can A/B them in the same harness.
Which one should you choose?
Decision rules, by what you actually care about:
- You need open weights / on-prem / data residency: Kimi K2.7 Code, full stop — it's the only open option, and Qwen 3.7 has no weights. (If your hardware is modest, run an open Qwen 3.6 or a smaller model instead and revisit when you can host 345GB+.)
- You want the lowest token cost and fast iteration: Kimi. ~3-5x cheaper per token, a $0.19 cached-input tier, and a HighSpeed mode. Qwen's verbosity makes its effective cost worse than its list price.
- You want the highest raw benchmark scores and the longest context: Qwen3.7-Max. GPQA 92.4, SWE-bench Verified 80.4, and a 1M window beat Kimi's published figures and 256K — accepting closed/API-only and higher cost.
- You run multi-hour autonomous loops: Qwen's stamina demos and 1M context tilt this way, though you'll pay for it.
- You do heavy MCP tool-calling: roughly a tie on MCP Atlas (76.4 vs 76.0); Kimi's MCP-Mark Verified edge plus lower cost makes it the value pick.
- You can't tolerate vendor lock-in or surprise rate limits: Kimi, because you can ultimately self-host.
The blunt version: for the literal question "best open agentic coder," Kimi K2.7 Code wins because it's the only open one. For "best agentic coder regardless of openness," Qwen3.7-Max has the stronger benchmark sheet and longer context but is closed, pricier, and verbose — and both still ride on vendor-reported numbers that no independent suite has confirmed. Run your own eval on your own repo before you standardize on either.
If your bottleneck isn't the model but the people steering it, that's a different problem. Codersera helps teams extend their engineering team with vetted remote developers who already work fluently with agentic coding tools — useful when you've picked a model and need hands that can put it to work.
FAQ
Is Qwen 3.7 open source?
No. Qwen3.7-Max is closed and API-only, served through Alibaba Cloud Model Studio, with no open weights released as of late May 2026. The open Qwen line is still Qwen 3.6 (Qwen3.6-35B-A3B and Qwen3.6-27B, Apache 2.0). If you need downloadable Qwen weights today, you run the 3.6 generation, not 3.7.
Is Kimi K2.7 Code open source?
Yes. Moonshot AI released Kimi K2.7 Code on June 12, 2026 under a Modified MIT license, with the full ~1.06T-parameter MoE weights (32B active) published on Hugging Face. It's the only genuinely open model in this comparison.
Can I run Kimi K2.7 on a normal GPU?
Not realistically. INT4 needs roughly 577GB of VRAM (about 8x A100 80GB). The practical "at home" floor is Unsloth's dynamic 2-bit build (~345GB combined RAM+VRAM), which runs at single-digit tokens per second with reduced quality. A 24GB consumer card cannot run it — most people should use the hosted Anthropic-compatible API instead.
Which is cheaper, Qwen3.7-Max or Kimi K2.7 Code?
Kimi, by roughly 3-5x. Kimi K2.7 Code is about $0.95 cache-miss input / $0.19 cached input / $4.00 output per 1M tokens. Qwen3.7-Max's official tier is around $2.50 input / $7.50 output (cheaper on some third parties). Qwen's verbosity — one eval hit ~97M tokens vs a ~24M median — widens the effective-cost gap further.
Which has the bigger context window?
Qwen3.7-Max, at 1M tokens versus Kimi K2.7's 256K (262,144). The larger window helps Qwen on whole-repo reasoning and very long agent traces, which aligns with its long-horizon autonomy pitch.
Are these benchmark numbers trustworthy?
Treat them as directional. Nearly every headline figure on both sides is vendor-reported on proprietary suites (Kimi Code Bench v2, MLS Bench, MCP-Mark; Qwen's own SWE-Pro). No independent SWE-bench Verified or Terminal-Bench results exist for Kimi K2.7 yet, and early hands-on reports are mixed. Run your own evaluation before standardizing on either model.