Open Source LLMs

Kimi K2.6 vs DeepSeek V4 vs GLM-5.1: The Open-Weights Coding Verdict (2026)

Kimi K2.6 vs DeepSeek V4 vs GLM-5.1 for coding in 2026 — sourced benchmarks, real cost-per-task, self-host and license comparison, plus a clear pick-X-if decision block.

Published 18 May 2026 • Updated 06 Jul 2026 • 11 min read

Quick answer. No model wins outright. On neutral testing (Artificial Analysis Intelligence Index v4.0) Kimi K2.6 leads open weights at 54, DeepSeek V4 Pro follows at 52, GLM-5.1 at 51. Pick Kimi K2.6 for top raw agentic intelligence, DeepSeek V4 Pro for the best agentic real-world score and now-permanent $0.435/$0.87-per-1M pricing, GLM-5.1 for a clean MIT license and frontier SWE-bench Pro scores.

Heads-up — a newer generation of this matchup exists. Since this comparison was written, Moonshot shipped Kimi K2.7-Code (June 11) and Z.ai shipped GLM-5.2 (June 16). The analysis below still holds for K2.6/GLM-5.1, but for the current-gen verdict read: Kimi K2.7 vs DeepSeek V4, Kimi K2.7 vs GLM 5.2, and the Kimi K2.7 complete guide.

Updated 2026-05-23: DeepSeek V4-Pro pricing flipped to permanent $0.435 / $0.87 per 1M input/output (cache-hit $0.003625/M) on 2026-05-22, per api-docs.deepseek.com. Table and cost-per-task section updated; rest of analysis (benchmarks, positioning) unchanged.

Updated 2026-06-14: GLM 5.2 has shipped. Z.ai launched GLM 5.2 on June 13, 2026 with a 1M-token context window and MIT-licensed open weights arriving the week after launch. Zhipu has not published benchmarks at launch, so the GLM-5.1 numbers below remain the published baseline for this comparison. For head-to-head on the new model: GLM 5.2 vs DeepSeek V4, GLM 5.2 vs Claude Opus 4.8, GLM 5.2 vs GPT-5.5.

By April 2026 the open-weights coding race stopped being a story about catching closed models and became a story about which open model to standardise on. Three names dominate the shortlist: Moonshot's Kimi K2.6, DeepSeek V4 (shipped as V4-Pro and V4-Flash), and Z.ai's GLM-5.1. All three are downloadable, all three are credible against the closed frontier on coding, and all three pick a different hill to win on.

The honest problem: there is no clean three-way canonical leaderboard. Independent harnesses (Artificial Analysis, LMArena) cover the models unevenly, and the vendors each publish numbers on the benchmarks where they look best. This piece separates neutral, independently-measured data from vendor-reported data, says explicitly where neutral data is missing, and ends with a decision block you can act on. The single sourced 3-column table below is the citable asset — read it first.

What are Kimi K2.6, DeepSeek V4 and GLM-5.1?

All three are large Mixture-of-Experts models released within a three-week window in April 2026, all with open weights on Hugging Face.

Kimi K2.6 (Moonshot AI, released 2026-04-20): 1T total parameters, ~32B active per token, 256K context, configurable thinking/instant modes. Released under a Modified MIT license (permissive, with a visible-attribution clause for very large deployments).
DeepSeek V4 (DeepSeek, public preview 2026-04-24): two hosted variants from one family. V4-Pro is 1.6T total / ~49B active; V4-Flash is 284B total / ~13B active. Both expose a 1M-token context. Weights and repo are MIT licensed.
GLM-5.1 (Z.ai, formerly Zhipu, released 2026-04-07): 744B total / ~40B active, 200K context, tuned for long-horizon agentic engineering. MIT licensed.

For background on each model individually, see our Kimi K2.6 complete guide and our DeepSeek V4 complete guide; this article is the head-to-head.

How do they compare on the master benchmark table?

This is the load-bearing table. Every cell is labelled (neutral) if it comes from an independent harness (Artificial Analysis or LMArena) or (vendor) if it is self-reported by the model maker. Where a neutral number does not exist, the cell says so explicitly rather than substituting a vendor number silently.

Dimension	Kimi K2.6	DeepSeek V4-Pro (+Flash)	GLM-5.1
AA Intelligence Index v4.0 (neutral)	54 (top open-weights)	52 (Pro, Max effort); 47 (Flash, Max)	51
AA GDPval-AA, agentic real-world tasks (neutral)	1484	1554 (Pro, leads open weights)	1535
LMArena Code Arena Elo (neutral)	not independently reported	not independently reported	1530
SWE-bench Verified (vendor)	80.2%	80.6% Pro / 79.0% Flash	77.8%
SWE-bench Pro (vendor)	58.6% (vendor claims #1)	close behind, exact figure not cleanly reported	58.4% (vendor claims #1)
Terminal-Bench 2.0 (vendor)	not cleanly reported	67.9% (Pro)	strong but exact figure not cleanly reported
LiveCodeBench Pass@1 (vendor)	not cleanly reported	93.5% Pro / 91.6% Flash	not cleanly reported
Context window	256K	1M (both variants)	200K
List price, $ / 1M in→out	$0.95 → $4.00 (AA-verified)	Pro $0.435 → $0.87 (cache-hit $0.003625/M); Flash $0.14 → $0.28 (vendor list, permanent as of 2026-05-22)	$1.40 → $4.40 (vendor list)
Architecture	1T MoE / ~32B active	Pro 1.6T / ~49B; Flash 284B / ~13B	744B MoE / ~40B active
License	Modified MIT	MIT	MIT

How to read this honestly. The two columns you can trust as apples-to-apples are the neutral rows: Artificial Analysis runs the same harness against every model. There, Kimi K2.6 leads on the composite Intelligence Index (54), but DeepSeek V4-Pro actually wins the GDPval-AA agentic real-world benchmark (1554), and GLM-5.1 holds the only independent Code Arena Elo on record (1530). The vendor SWE-bench / Terminal-Bench / LiveCodeBench rows are useful but not directly comparable across vendors because harness configs, scaffolds, and effort settings differ — treat them as each vendor's best-case, not a ranking.

What does the neutral data actually say?

Strip out the vendor numbers and the picture narrows to three independent signals:

Artificial Analysis Intelligence Index v4.0 (a composite over GDPval-AA, Terminal-Bench Hard, SciCode, AA-LCR, GPQA Diamond and others): Kimi K2.6 54, DeepSeek V4-Pro (Max effort) 52, GLM-5.1 51, DeepSeek V4-Flash (Max) 47. The top three are within 3 points — effectively a statistical tie at the composite level, with Kimi nominally first.
GDPval-AA (Artificial Analysis's agentic real-world work benchmark): here the order flips — DeepSeek V4-Pro 1554, GLM-5.1 1535, Kimi K2.6 1484. If your workload is long-horizon agentic engineering rather than single-shot reasoning, V4-Pro is the neutral leader.
LMArena Code Arena: GLM-5.1 has an independently confirmed 1530 Elo. As of this writing there is no comparable independent Code Arena number on record for Kimi K2.6 or DeepSeek V4 — so GLM-5.1 is the only one of the three with a public, independent human-preference coding signal. That is a gap in the neutral data, not a point against the other two.

Net: on the strongest neutral evidence available, this is not a blowout. Kimi K2.6 edges the composite, DeepSeek V4-Pro edges the agentic-work benchmark, and GLM-5.1 owns the only independent coding-Elo signal.

Which is cheapest per coding task?

Per-token list price is not cost-per-task — verbosity matters. Artificial Analysis measured how many output tokens each model burns to run its own Intelligence Index, and the DeepSeek V4 family is notably token-hungry: V4-Pro used ~190M output tokens, V4-Flash ~240M, for the same benchmark suite. A cheap per-token price multiplied by high verbosity can erase the headline advantage.

Working the list prices into a rough cost-per-task model (assume a typical agentic coding task is input-heavy, ~5:1 input:output):

DeepSeek V4-Flash ($0.14 in / $0.28 out, vendor list) is still the cheapest competent option — about 3× cheaper than V4-Pro per token and an order of magnitude below Kimi K2.6 and GLM-5.1. It lands within ~1.6 points of V4-Pro on vendor SWE-bench Verified (79.0% vs 80.6%), which makes it the default for high-volume CI agents, batch refactors, and test generation.
Kimi K2.6 ($0.95 in / $4.00 out, AA-verified) is mid-priced with an 83% cache-hit discount on input ($0.16 cached) — strong for cache-heavy agent loops that re-send a stable system prompt and repo map every turn.
DeepSeek V4-Pro ($0.435 in / $0.87 out, vendor list — permanent as of 2026-05-22, per api-docs.deepseek.com) is now dramatically cheaper than it was at launch. With the 75% price cut made permanent and a 90%+ cache-hit discount ($0.003625/M on cached input), V4-Pro is roughly 3× the price of V4-Flash and ~2× cheaper than Kimi K2.6 on output — while still being the GDPval-AA neutral leader. Output verbosity remains the real cost driver, but the post-2026-05-22 economics make V4-Pro a much more comfortable default for long-horizon agentic work than it was a week earlier.
GLM-5.1 ($1.40 in / $4.40 out, vendor list) is now the most expensive of the three on both input and output by a wide margin, offset by a 200K context (cheaper to keep full than a 1M window you actually fill) and the strongest license story.

Rule of thumb (post-pricing-flip): if cost dominates your decision, V4-Flash still wins by a wide margin. But the V4-Pro permanent price cut narrows the gap to V4-Flash from "order of magnitude" to roughly 3×, while keeping V4-Pro's GDPval-AA lead — so for agentic work that genuinely benefits from the bigger model, V4-Pro is now defensible where it was a budget stretch.

How do they compare on self-host and license?

All three are genuinely self-hostable, but the GPU bill and license terms differ.

	Kimi K2.6	DeepSeek V4-Pro / Flash	GLM-5.1
License	Modified MIT (attribution clause at very large scale)	MIT (cleanest)	MIT (cleanest)
Inference engines	vLLM, SGLang, KTransformers (Moonshot's own)	vLLM, SGLang	vLLM, SGLang, llama.cpp, Unsloth
Practical full-precision hardware	~8× H200-class; 4× H100 viable with native INT4 (QAT) at reduced context	Pro is the heaviest (1.6T); Flash (284B) is the easiest of all three to self-host	~8× H100 80GB / ~860GB VRAM for the FP8 checkpoint
Self-host sweet spot	Teams wanting top intelligence and willing to pin vLLM/SGLang versions + QAT INT4	V4-Flash: smallest credible model, 1M context, MIT	Sovereignty/compliance teams that need a clean MIT license and a 200K context

If "clean license, no asterisks" is a hard requirement (legal, regulated, or resale scenarios), GLM-5.1 and DeepSeek V4 are unencumbered MIT; Kimi K2.6's Modified MIT only adds friction at very large deployment scale, but it is an asterisk a lawyer will flag. If "smallest model I can actually run on my own GPUs" is the constraint, DeepSeek V4-Flash at 284B total is the clear answer and still ships the 1M context window.

Companion guide

For the full landscape — every major open model, how the licenses really differ, and where each one fits in a production stack, see our open-source LLMs landscape for 2026.

Which model should you pick?

The decision block. Match the left column to your real constraint, not to the highest leaderboard number.

Pick Kimi K2.6 if you want the highest neutral composite intelligence in open weights (AA Index 54) and your agent loop is cache-heavy enough to exploit the 83% input-cache discount. Best general-purpose open coder if budget is secondary to capability.
Pick DeepSeek V4-Flash if cost or self-host footprint is the binding constraint. It is roughly 3× cheaper per token than V4-Pro and an order of magnitude below Kimi K2.6 / GLM-5.1, within ~1.6 SWE-bench points of V4-Pro, ships a 1M context, and is the smallest credible model here (284B). Default for high-volume CI agents and batch work.
Pick DeepSeek V4-Pro if your workload is long-horizon agentic engineering and you want the neutral GDPval-AA leader (1554) plus a 1M context. The May 2026 permanent price flip ($0.435/$0.87 per 1M) made the per-token cost case substantially easier than at launch; output verbosity is still the cost driver, but V4-Pro is now defensible as a daily-driver, not just a hard-task model.
Pick GLM-5.1 if you need an unencumbered MIT license, a real independent coding signal (1530 Code Arena Elo), and you do not need more than a 200K context. The sovereignty/compliance pick.

What is the honest caveat about these numbers?

Be skeptical of any clean ranking — including the headline of this article. Three caveats matter:

Vendor SWE-bench / Terminal-Bench / LiveCodeBench numbers are not cross-comparable. Each vendor runs its own scaffold, retry budget, and effort setting. Kimi and GLM both claim #1 on SWE-bench Pro within 0.2 points of each other — that gap is inside the noise of differing harnesses. Use vendor numbers to confirm a model is in the frontier band, not to rank within it.
Neutral data is incomplete. Artificial Analysis covers all three on the Intelligence Index and GDPval-AA, but LMArena's Code Arena has a public independent Elo only for GLM-5.1 at this time. "No independent number" is not "a low number" — it is missing data, and we have marked it as such rather than backfilling with a vendor figure.
Pricing moves and varies by route. DeepSeek V4-Pro pricing was a 75% time-window promo at launch; on 2026-05-22 the company made it permanent at $0.435/$0.87 per 1M (cache-hit $0.003625/M). Kimi K2.6 and GLM-5.1 list prices have held steady. Third-party hosts still re-price all three, and DeepSeek may run further promos on top of the new base — re-check the vendor and Artificial Analysis pages before you commit a budget.

The defensible conclusion is the modest one in the Quick Answer: at the top of the open-weights band these three are close enough that the right pick is decided by your license, cost, context, and self-host constraints — not by a single benchmark number.

If you are building or operating agentic coding infrastructure on these open-weights models and want senior engineers who have shipped it in production, Codersera matches you with vetted remote developers experienced with self-hosted LLM serving, agent harnesses, and cost-control tooling. We run a risk-free trial so you can validate technical fit before committing.

FAQ

Is Kimi K2.6 better than DeepSeek V4 for coding?

On neutral data it depends on the task. Kimi K2.6 leads the Artificial Analysis Intelligence Index composite (54 vs DeepSeek V4-Pro's 52), but DeepSeek V4-Pro wins Artificial Analysis's GDPval-AA agentic real-world benchmark (1554 vs Kimi's 1484). For single-shot reasoning Kimi edges it; for long-horizon agentic engineering DeepSeek V4-Pro edges it. Vendor SWE-bench Verified numbers (80.2% vs 80.6%) are effectively tied.

Which is the cheapest of the three?

DeepSeek V4-Flash by a wide margin: vendor list price is $0.14 per 1M input and $0.28 per 1M output tokens, roughly an order of magnitude below Kimi K2.6 ($0.95/$4.00, AA-verified) and GLM-5.1 ($1.40/$4.40, vendor list), and about 3× cheaper than V4-Pro at its new permanent $0.435/$0.87 rate. It still scores within about 1.6 points of V4-Pro on vendor SWE-bench Verified, which makes it the value pick for high-volume coding workloads.

Do all three have truly open licenses?

DeepSeek V4 and GLM-5.1 are released under a clean, unmodified MIT license — commercial use, modification, and redistribution with no usage restrictions. Kimi K2.6 uses a Modified MIT license that adds a visible-attribution requirement for very large deployments. All three publish full weights on Hugging Face, so all three are genuinely self-hostable; the Kimi clause is the only license asterisk among them.

Which has the largest context window?

DeepSeek V4 — both V4-Pro and V4-Flash expose a 1M-token context window. Kimi K2.6 offers 256K and GLM-5.1 offers 200K. For whole-repository agentic tasks the DeepSeek 1M window is a real advantage, but remember that filling a 1M window costs input tokens every turn, so a smaller window you do not over-fill can be cheaper in practice.

Why do vendor benchmarks disagree with neutral ones?

Vendors run benchmarks with their own scaffolds, retry budgets, and effort/thinking settings, each tuned to show the model at its best. Independent harnesses like Artificial Analysis apply one fixed methodology to every model, which is why their numbers are comparable across models and vendor numbers are not. Trust vendor numbers to confirm a model is frontier-class; trust neutral numbers to rank models against each other.

Which is easiest to self-host?

DeepSeek V4-Flash, because at 284B total parameters it is by far the smallest of the four model variants while still shipping a 1M context and a clean MIT license. GLM-5.1's FP8 checkpoint needs roughly 8x H100 80GB; Kimi K2.6 wants 8x H200-class at full precision (though native INT4 quantization-aware training lets it run on 4x H100 at reduced context). All three support vLLM and SGLang.

Which one should a team standardise on in 2026?

If you need one default: DeepSeek V4-Flash for cost-sensitive high-volume work, Kimi K2.6 if you want the top neutral intelligence and can pay for it, GLM-5.1 if a clean MIT license and independent coding-Elo evidence matter for compliance. Many teams run two — a cheap model (V4-Flash) for bulk agent work and a stronger one (Kimi K2.6 or V4-Pro, now meaningfully cheaper since the 2026-05-22 pricing flip) for hard tasks — rather than forcing a single choice.

Does GLM 5.2 change this comparison?

Not immediately. GLM 5.2 shipped on June 13, 2026 with a 1M-token context window and MIT-licensed open weights (the week after launch), but Zhipu has not published benchmarks for 5.2 yet — so on independently verifiable numbers, GLM-5.1 remains the model in this table. If 5.2 holds 5.1's SWE-Bench Pro and Terminal-Bench 2.0 gains while adding the 1M window, it becomes the open-weights model with the most usable context. We'll update this comparison once independent third-party numbers land.