Quick answer. No model wins outright. On neutral testing (Artificial Analysis Intelligence Index v4.0) Kimi K2.6 leads open weights at 54, DeepSeek V4 Pro follows at 52, GLM-5.1 at 51. Pick Kimi K2.6 for top raw agentic intelligence, DeepSeek V4 Flash for the cheapest competent coding, GLM-5.1 for a clean MIT license and frontier SWE-bench Pro scores.
By April 2026 the open-weights coding race stopped being a story about catching closed models and became a story about which open model to standardise on. Three names dominate the shortlist: Moonshot's Kimi K2.6, DeepSeek V4 (shipped as V4-Pro and V4-Flash), and Z.ai's GLM-5.1. All three are downloadable, all three are credible against the closed frontier on coding, and all three pick a different hill to win on.
The honest problem: there is no clean three-way canonical leaderboard. Independent harnesses (Artificial Analysis, LMArena) cover the models unevenly, and the vendors each publish numbers on the benchmarks where they look best. This piece separates neutral, independently-measured data from vendor-reported data, says explicitly where neutral data is missing, and ends with a decision block you can act on. The single sourced 3-column table below is the citable asset — read it first.
What are Kimi K2.6, DeepSeek V4 and GLM-5.1?
All three are large Mixture-of-Experts models released within a three-week window in April 2026, all with open weights on Hugging Face.
- Kimi K2.6 (Moonshot AI, released 2026-04-20): 1T total parameters, ~32B active per token, 256K context, configurable thinking/instant modes. Released under a Modified MIT license (permissive, with a visible-attribution clause for very large deployments).
- DeepSeek V4 (DeepSeek, public preview 2026-04-24): two hosted variants from one family. V4-Pro is 1.6T total / ~49B active; V4-Flash is 284B total / ~13B active. Both expose a 1M-token context. Weights and repo are MIT licensed.
- GLM-5.1 (Z.ai, formerly Zhipu, released 2026-04-07): 744B total / ~40B active, 200K context, tuned for long-horizon agentic engineering. MIT licensed.
For background on each model individually, see our Kimi K2.6 complete guide and our DeepSeek V4 complete guide; this article is the head-to-head.
How do they compare on the master benchmark table?
This is the load-bearing table. Every cell is labelled (neutral) if it comes from an independent harness (Artificial Analysis or LMArena) or (vendor) if it is self-reported by the model maker. Where a neutral number does not exist, the cell says so explicitly rather than substituting a vendor number silently.
| Dimension | Kimi K2.6 | DeepSeek V4-Pro (+Flash) | GLM-5.1 |
|---|---|---|---|
| AA Intelligence Index v4.0 (neutral) | 54 (top open-weights) | 52 (Pro, Max effort); 47 (Flash, Max) | 51 |
| AA GDPval-AA, agentic real-world tasks (neutral) | 1484 | 1554 (Pro, leads open weights) | 1535 |
| LMArena Code Arena Elo (neutral) | not independently reported | not independently reported | 1530 |
| SWE-bench Verified (vendor) | 80.2% | 80.6% Pro / 79.0% Flash | 77.8% |
| SWE-bench Pro (vendor) | 58.6% (vendor claims #1) | close behind, exact figure not cleanly reported | 58.4% (vendor claims #1) |
| Terminal-Bench 2.0 (vendor) | not cleanly reported | 67.9% (Pro) | strong but exact figure not cleanly reported |
| LiveCodeBench Pass@1 (vendor) | not cleanly reported | 93.5% Pro / 91.6% Flash | not cleanly reported |
| Context window | 256K | 1M (both variants) | 200K |
| List price, $ / 1M in→out | $0.95 → $4.00 (AA-verified) | Pro $1.74 → $3.48; Flash $0.14 → $0.28 (vendor list) | $1.40 → $4.40 (vendor list) |
| Architecture | 1T MoE / ~32B active | Pro 1.6T / ~49B; Flash 284B / ~13B | 744B MoE / ~40B active |
| License | Modified MIT | MIT | MIT |
How to read this honestly. The two columns you can trust as apples-to-apples are the neutral rows: Artificial Analysis runs the same harness against every model. There, Kimi K2.6 leads on the composite Intelligence Index (54), but DeepSeek V4-Pro actually wins the GDPval-AA agentic real-world benchmark (1554), and GLM-5.1 holds the only independent Code Arena Elo on record (1530). The vendor SWE-bench / Terminal-Bench / LiveCodeBench rows are useful but not directly comparable across vendors because harness configs, scaffolds, and effort settings differ — treat them as each vendor's best-case, not a ranking.
What does the neutral data actually say?
Strip out the vendor numbers and the picture narrows to three independent signals:
- Artificial Analysis Intelligence Index v4.0 (a composite over GDPval-AA, Terminal-Bench Hard, SciCode, AA-LCR, GPQA Diamond and others): Kimi K2.6 54, DeepSeek V4-Pro (Max effort) 52, GLM-5.1 51, DeepSeek V4-Flash (Max) 47. The top three are within 3 points — effectively a statistical tie at the composite level, with Kimi nominally first.
- GDPval-AA (Artificial Analysis's agentic real-world work benchmark): here the order flips — DeepSeek V4-Pro 1554, GLM-5.1 1535, Kimi K2.6 1484. If your workload is long-horizon agentic engineering rather than single-shot reasoning, V4-Pro is the neutral leader.
- LMArena Code Arena: GLM-5.1 has an independently confirmed 1530 Elo. As of this writing there is no comparable independent Code Arena number on record for Kimi K2.6 or DeepSeek V4 — so GLM-5.1 is the only one of the three with a public, independent human-preference coding signal. That is a gap in the neutral data, not a point against the other two.
Net: on the strongest neutral evidence available, this is not a blowout. Kimi K2.6 edges the composite, DeepSeek V4-Pro edges the agentic-work benchmark, and GLM-5.1 owns the only independent coding-Elo signal.
Which is cheapest per coding task?
Per-token list price is not cost-per-task — verbosity matters. Artificial Analysis measured how many output tokens each model burns to run its own Intelligence Index, and the DeepSeek V4 family is notably token-hungry: V4-Pro used ~190M output tokens, V4-Flash ~240M, for the same benchmark suite. A cheap per-token price multiplied by high verbosity can erase the headline advantage.
Working the list prices into a rough cost-per-task model (assume a typical agentic coding task is input-heavy, ~5:1 input:output):
- DeepSeek V4-Flash ($0.14 in / $0.28 out, vendor list) is the runaway cheapest competent option — roughly an order of magnitude below the others per token, and it lands within ~1.6 points of V4-Pro on vendor SWE-bench Verified (79.0% vs 80.6%). For high-volume CI agents, batch refactors, and test generation, it is the default.
- Kimi K2.6 ($0.95 in / $4.00 out, AA-verified) is mid-priced with an 83% cache-hit discount on input ($0.16 cached) — strong for cache-heavy agent loops that re-send a stable system prompt and repo map every turn.
- DeepSeek V4-Pro ($1.74 in / $3.48 out, vendor list) is competitive per-token but its high output verbosity pushes real per-task cost up; budget for it. Some aggregators quote off-peak/discounted V4-Pro pricing near $0.43/$0.87 — that is a time-window discount, not the standing list rate, so model your budget on the list price and treat the discount as upside.
- GLM-5.1 ($1.40 in / $4.40 out, vendor list) is the most expensive on output of the three, offset by a 200K context (cheaper to keep full than a 1M window you actually fill) and the strongest license story.
Rule of thumb: if cost dominates your decision, V4-Flash wins by a wide margin. If you are paying for the top of the neutral intelligence band, the Kimi/V4-Pro/GLM premium is real but small relative to closed-frontier pricing.
How do they compare on self-host and license?
All three are genuinely self-hostable, but the GPU bill and license terms differ.
| Kimi K2.6 | DeepSeek V4-Pro / Flash | GLM-5.1 | |
|---|---|---|---|
| License | Modified MIT (attribution clause at very large scale) | MIT (cleanest) | MIT (cleanest) |
| Inference engines | vLLM, SGLang, KTransformers (Moonshot's own) | vLLM, SGLang | vLLM, SGLang, llama.cpp, Unsloth |
| Practical full-precision hardware | ~8× H200-class; 4× H100 viable with native INT4 (QAT) at reduced context | Pro is the heaviest (1.6T); Flash (284B) is the easiest of all three to self-host | ~8× H100 80GB / ~860GB VRAM for the FP8 checkpoint |
| Self-host sweet spot | Teams wanting top intelligence and willing to pin vLLM/SGLang versions + QAT INT4 | V4-Flash: smallest credible model, 1M context, MIT | Sovereignty/compliance teams that need a clean MIT license and a 200K context |
If "clean license, no asterisks" is a hard requirement (legal, regulated, or resale scenarios), GLM-5.1 and DeepSeek V4 are unencumbered MIT; Kimi K2.6's Modified MIT only adds friction at very large deployment scale, but it is an asterisk a lawyer will flag. If "smallest model I can actually run on my own GPUs" is the constraint, DeepSeek V4-Flash at 284B total is the clear answer and still ships the 1M context window.
Companion guide
For the full landscape — every major open model, how the licenses really differ, and where each one fits in a production stack, see our open-source LLMs landscape for 2026.
Which model should you pick?
The decision block. Match the left column to your real constraint, not to the highest leaderboard number.
- Pick Kimi K2.6 if you want the highest neutral composite intelligence in open weights (AA Index 54) and your agent loop is cache-heavy enough to exploit the 83% input-cache discount. Best general-purpose open coder if budget is secondary to capability.
- Pick DeepSeek V4-Flash if cost or self-host footprint is the binding constraint. It is roughly an order of magnitude cheaper per token than the field, within ~1.6 SWE-bench points of V4-Pro, ships a 1M context, and is the smallest credible model here (284B). Default for high-volume CI agents and batch work.
- Pick DeepSeek V4-Pro if your workload is long-horizon agentic engineering and you want the neutral GDPval-AA leader (1554) plus a 1M context — and you can absorb its higher output verbosity in the budget.
- Pick GLM-5.1 if you need an unencumbered MIT license, a real independent coding signal (1530 Code Arena Elo), and you do not need more than a 200K context. The sovereignty/compliance pick.
What is the honest caveat about these numbers?
Be skeptical of any clean ranking — including the headline of this article. Three caveats matter:
- Vendor SWE-bench / Terminal-Bench / LiveCodeBench numbers are not cross-comparable. Each vendor runs its own scaffold, retry budget, and effort setting. Kimi and GLM both claim #1 on SWE-bench Pro within 0.2 points of each other — that gap is inside the noise of differing harnesses. Use vendor numbers to confirm a model is in the frontier band, not to rank within it.
- Neutral data is incomplete. Artificial Analysis covers all three on the Intelligence Index and GDPval-AA, but LMArena's Code Arena has a public independent Elo only for GLM-5.1 at this time. "No independent number" is not "a low number" — it is missing data, and we have marked it as such rather than backfilling with a vendor figure.
- Pricing moves and varies by route. List prices here are the standing rates at time of writing; DeepSeek in particular publishes time-window discounts, and third-party hosts re-price all three. Re-check the vendor and Artificial Analysis pages before you commit a budget.
The defensible conclusion is the modest one in the Quick Answer: at the top of the open-weights band these three are close enough that the right pick is decided by your license, cost, context, and self-host constraints — not by a single benchmark number.
If you are building or operating agentic coding infrastructure on these open-weights models and want senior engineers who have shipped it in production, Codersera matches you with vetted remote developers experienced with self-hosted LLM serving, agent harnesses, and cost-control tooling. We run a risk-free trial so you can validate technical fit before committing.
FAQ
Is Kimi K2.6 better than DeepSeek V4 for coding?
On neutral data it depends on the task. Kimi K2.6 leads the Artificial Analysis Intelligence Index composite (54 vs DeepSeek V4-Pro's 52), but DeepSeek V4-Pro wins Artificial Analysis's GDPval-AA agentic real-world benchmark (1554 vs Kimi's 1484). For single-shot reasoning Kimi edges it; for long-horizon agentic engineering DeepSeek V4-Pro edges it. Vendor SWE-bench Verified numbers (80.2% vs 80.6%) are effectively tied.
Which is the cheapest of the three?
DeepSeek V4-Flash by a wide margin: vendor list price is $0.14 per 1M input and $0.28 per 1M output tokens, roughly an order of magnitude below Kimi K2.6 ($0.95/$4.00, AA-verified) and GLM-5.1 ($1.40/$4.40, vendor list). It still scores within about 1.6 points of V4-Pro on vendor SWE-bench Verified, which makes it the value pick for high-volume coding workloads.
Do all three have truly open licenses?
DeepSeek V4 and GLM-5.1 are released under a clean, unmodified MIT license — commercial use, modification, and redistribution with no usage restrictions. Kimi K2.6 uses a Modified MIT license that adds a visible-attribution requirement for very large deployments. All three publish full weights on Hugging Face, so all three are genuinely self-hostable; the Kimi clause is the only license asterisk among them.
Which has the largest context window?
DeepSeek V4 — both V4-Pro and V4-Flash expose a 1M-token context window. Kimi K2.6 offers 256K and GLM-5.1 offers 200K. For whole-repository agentic tasks the DeepSeek 1M window is a real advantage, but remember that filling a 1M window costs input tokens every turn, so a smaller window you do not over-fill can be cheaper in practice.
Why do vendor benchmarks disagree with neutral ones?
Vendors run benchmarks with their own scaffolds, retry budgets, and effort/thinking settings, each tuned to show the model at its best. Independent harnesses like Artificial Analysis apply one fixed methodology to every model, which is why their numbers are comparable across models and vendor numbers are not. Trust vendor numbers to confirm a model is frontier-class; trust neutral numbers to rank models against each other.
Which is easiest to self-host?
DeepSeek V4-Flash, because at 284B total parameters it is by far the smallest of the four model variants while still shipping a 1M context and a clean MIT license. GLM-5.1's FP8 checkpoint needs roughly 8x H100 80GB; Kimi K2.6 wants 8x H200-class at full precision (though native INT4 quantization-aware training lets it run on 4x H100 at reduced context). All three support vLLM and SGLang.
Which one should a team standardise on in 2026?
If you need one default: DeepSeek V4-Flash for cost-sensitive high-volume work, Kimi K2.6 if you want the top neutral intelligence and can pay for it, GLM-5.1 if a clean MIT license and independent coding-Elo evidence matter for compliance. Many teams run two — a cheap model (V4-Flash) for bulk agent work and a stronger one (Kimi K2.6 or V4-Pro) for hard tasks — rather than forcing a single choice.