The most interesting open-weights coding duel in mid-2026 isn't between an open and a closed model. It's between two open ones. Moonshot's Kimi K2.7 Code and DeepSeek's V4 are both genuinely usable, both permissive on weights, and both pushing the envelope on a different axis. K2.7 bets on agent-loop quality and MCP tool use. V4 bets on raw per-token cost and battle-tested independent benchmarks. Here's how they compare where it matters.
Kimi K2.7 vs DeepSeek V4: at a glance
| Dimension | Kimi K2.7 Code | DeepSeek V4 |
|---|---|---|
| Maker | Moonshot AI (China) | DeepSeek (China) |
| Released | June 2026 | Q1 2026 |
| License | Modified MIT (open weights) | DeepSeek License v2 (commercial-friendly) |
| Architecture | 1T-param MoE (32B active), 384 experts | Large MoE (671B total / 37B active class) |
| Context window | 256K | 128K - 256K depending on tier |
| API pricing | $0.95 / $4.00 per M (cache-hit input $0.19) | V4-Flash: $0.14 / $0.28. V4-Pro: $1.74 / $3.48. |
| Multi-modal | Vision (MoonViT 400M) + code | Text + code + vision (V4-Pro) |
| Coding positioning | Agentic + MCP-first | General-purpose with strong coding |
How do the coding benchmarks actually compare?
Honest answer: K2.7 has Moonshot's own numbers but no independent ones yet. V4 has both vendor and third-party numbers across most of the standard suites.
What we know about Kimi K2.7 (Moonshot benchmarks):
- Kimi Code Bench v2: 62.0 (up from 50.9 on K2.6)
- Program Bench: 53.6 (up from 48.3)
- MLS Bench Lite: 35.1 (up from 26.7)
- MCP Atlas: 76.0 (up from 69.4)
- MCP Mark Verified: 81.1 (up from 72.8)
- ~30% fewer thinking tokens vs K2.6 on equivalent tasks
SWE-bench Verified, SWE-bench Pro, LiveCodeBench, Terminal-Bench, AIDER Polyglot — no independent K2.7 numbers at the time of writing. Expect them 2-4 weeks post-release.
What we know about DeepSeek V4 (vendor + independent):
- SWE-bench Verified: ~88% on the favourable scaffold, ~76% on the standard one
- LiveCodeBench: ~85% Pass@1
- HumanEval: 95%+
- Strong on multi-language and a notable lead on Python repo refactors
The pattern: DeepSeek V4 looks stronger on the classic single-shot code-generation benches. K2.7 leads (against itself) on long-horizon agentic + MCP benches but lacks comparable independent runs. For “produce a 200-line patch given a complete spec” V4 has more public mileage. For an MCP-orchestrated agent loop, K2.7 is the better-aimed model.
How different is the context window?
Both nominally support 256K, but the realities differ.
K2.7 fully supports 256K as its standard context. The MoE attention layout (61 layers with Multi-head Latent Attention) was tuned for the full window.
V4 standard context is 128K, extended to 256K on V4-Pro. Real-world agents typically cap context lower to control cost — V4-Flash's economics make 200K-class agentic runs cost-effective; on V4-Pro, the larger window is available but more expensive.
For repo-scale agents that need to read 300+ files (~600K tokens), neither model is the right choice — that's GLM 5.2's territory. For a focused 50-150 file slice both fit comfortably.
What do the token economics look like?
DeepSeek V4-Flash at $0.14 input / $0.28 output is the cheapest serious coding API in the market today. About 36× cheaper than GPT-5.5 on input and over 100× cheaper on output. V4-Pro at $1.74 / $3.48 is still less than half of Claude Sonnet on output.
Kimi K2.7 at $0.95 / $4.00 native (or $0.75 / $3.50 via OpenRouter) is in a different price band than V4-Flash but cheaper than V4-Pro on output. The killer feature is cache-hit pricing at $0.19 per M input tokens — for agents that re-traverse the same codebase (a very common pattern), input cost is nearly free.
The pricing logic: V4-Flash dominates if your workload is “many short single-shot completions.” K2.7 dominates if your workload is “long-horizon agent loops over a known codebase with prompt caching.” The crossover is task-shape dependent. Run both on 100 of your real tasks and compute per-task cost — neither headline price tells the full story.
Self-hosting: the two paths compared
Both ship open weights, both are self-hostable, both are MoE in roughly the same size class.
DeepSeek V4 has been in the wild for months. The vLLM, TensorRT-LLM, and SGLang teams have shipped multiple rounds of optimizations specifically for V4's MoE shape. Hosted inference is available from every major provider, often at sub-$0.50 per million tokens for V4-Flash equivalents. Quantized variants (4-bit, FP8) are well-tested and ship with usable quality.
K2.7 weights are available with the release. Inference-engine optimization typically follows by 1-2 weeks for the major engines; hosted endpoints arrive on similar timelines. Plan for an extra month of maturation before treating K2.7 self-hosting as production-grade. If you need open weights today, DeepSeek V4 is the safer call. If your timeline is “Q3 2026 or later,” K2.7 is fully in scope.
Does multi-modal tip the decision?
Both support image inputs. K2.7 ships the MoonViT vision encoder (400M parameters); V4-Pro supports vision input natively. Both handle the common image-to-code workflows (screenshots, mockups, design specs). Neither has audio.
For pure-text agentic coding (the most common case), it's a wash.
Who should pick Kimi K2.7?
- MCP-orchestrated agent stacks. The 76.0 MCP Atlas and 81.1 MCP Mark Verified scores reflect real workflow improvements; the model was tuned for this.
- Workloads with high prompt-cache reuse. $0.19 per M cache-hit input tokens changes the cost shape for agents that re-traverse the same codebase repeatedly.
- Thinking-token-bound runs. 30% fewer thinking tokens vs K2.6 on equivalent tasks compounds over a fleet of agentic runs.
- Teams investing in the open-weights long game. Modified-MIT means full fine-tuning, distillation, and custom RLHF on internal code.
Who should pick DeepSeek V4?
- High-volume API workloads. V4-Flash's $0.14 / $0.28 pricing is the lowest serious coding API rate available. Burst a million bug-triage runs through it at near-zero cost.
- Single-shot code generation. Public SWE-bench Verified scores in the high 80s reflect mature behaviour for “produce a patch from a spec.”
- Production agents that need the model today. The hosted inference and self-hosting ecosystem is mature; K2.7's hosted endpoints take 1-2 weeks to catch up.
- Mixed text + image workloads where V4-Pro fits the budget. The vision integration has months of mileage.
The decision in a line
If your agent is MCP-orchestrated or cache-hit-heavy → K2.7. If your workload is high-volume single-shot completions or you need the model in production this week → V4. If your task shape is in between, run both side-by-side on 100 representative tasks this week; the per-task cost gap and per-task quality gap both reveal themselves quickly.
FAQ
Is Kimi K2.7 better than DeepSeek V4 for coding?
On MCP-orchestrated multi-tool agent loops, K2.7's 76.0 MCP Atlas + 81.1 MCP Mark Verified scores reflect real gains. On single-shot SWE-bench-style code generation, V4 has more public mileage and proven independent numbers. “Better” depends on workload shape.
Which is cheaper, Kimi K2.7 or DeepSeek V4?
V4-Flash at $0.14 / $0.28 per M tokens is currently the cheapest serious coding API. K2.7 at $0.95 / $4 is materially more expensive on flat token cost — BUT cache-hit input at $0.19 dramatically reduces cost for agents that re-traverse the same codebase. Compute per-task cost on real workloads; neither headline price tells the full story.
Can I self-host both models?
Yes. V4 has months of inference-engine optimization. K2.7 weights ship with release; engine support is typically 1-2 weeks behind. Both need an 8×H100 node for serviceable serving at full context.
Do both models support 256K context?
K2.7 supports 256K as standard. V4 standard is 128K, extended to 256K on V4-Pro.
Do both models support image inputs?
Yes. K2.7 has the MoonViT vision encoder; V4-Pro supports vision natively. Both handle common image-to-code workflows.