Open Source

Best Open-Source LLM (2026): Kimi K2.6 vs DeepSeek V4 Pro vs GLM-5.1 vs Llama 4 vs Qwen 3.6

Five frontier-class open-weight LLMs shipped in 30 days. Real benchmarks, licenses, hosting costs, and a decision matrix for CTOs picking their 2026 stack.

Published 03 May 2026 • Updated 23 May 2026 • 12 min read

Updated May 23, 2026

Quick answer. As of May 2026, Kimi K2.6 (Moonshot AI) is the best open-source LLM overall — it tops the Artificial Analysis Intelligence Index among all open-weight models (54), ranking #4 of every model behind only Anthropic, Google, and OpenAI. DeepSeek V4 Pro leads agentic work and ties closed frontier on SWE-Bench; GLM-5.1 is the cleanest-license pick.

The open-weights frontier moves monthly, and the answer to "what's the best open source LLM in 2026?" is no longer the model that was best in March. Between early April and mid-May 2026, Moonshot shipped Kimi K2.6, Z.ai shipped GLM-5.1, DeepSeek shipped V4 Pro and V4 Flash, Xiaomi shipped MiMo-V2.5-Pro, MiniMax open-sourced M2.7, Google released Gemma 4, Alibaba released Qwen 3.6, and Ant Group's inclusionAI released Ring-2.6-1T. The buyer's question flipped from "can we get away with open?" to "which open-weight model do we standardize on for the next 12 months?"

Update — May 22, 2026: Alibaba announced Qwen3.7-Max at the Apsara Summit on May 20, with public pricing landing on OpenRouter on May 21 at $2.50 / $7.50 per 1M tokens. It is currently closed-weight (no Hugging Face release, no local-deployment path), so it sits outside this guide's open-weights scope — Qwen 3.6 remains Alibaba's latest open-weights flagship and the right comparison point in the ranking below. If you're evaluating the closed Max tier specifically, see the Qwen 3.7 release tracker and 3.7 vs 3.6 comparison.

This guide ranks the current leaders using neutral benchmarks where they exist — primarily the Artificial Analysis Intelligence Index (a composite of 10 independent evaluations) and the GDPval-AA agentic leaderboard — and clearly labels vendor-reported numbers as such. We do not invent scores. If a number isn't published or independently verified, we say so.

If you're a CTO or infra lead locking down the LLM layer of your stack, this is your decision guide for May 2026.

What is the best open source LLM in 2026?

By the broadest neutral measure — the Artificial Analysis Intelligence Index, which aggregates GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt — Kimi K2.6 is the leading open-weights model. It scores 54 on that index, landing at #4 across all models (open or closed), behind only the latest from Anthropic, Google, and OpenAI (all 57). That is the smallest gap between best-open and best-closed in the history of the category.

But "best" is workload-dependent. DeepSeek V4 Pro is the strongest open model for agentic real-world work (it leads the GDPval-AA leaderboard among open weights), and GLM-5.1 is the pick when license cleanliness drives the decision. Here is the full ranked picture.

Which open-source LLMs rank highest right now?

One row per model. The "Neutral score" column is the Artificial Analysis Intelligence Index (independent composite); a dash means AA has not published an index score for that exact variant. SWE-Bench figures are vendor-reported unless noted; we label them.

Rank	Model	Params (active / total)	License	Neutral score (AA Index)	Best for
1	Kimi K2.6 (Moonshot)	32B / 1T (MoE)	Modified MIT	54 (#1 open, #4 overall)	Best all-round + agentic coding
1=	MiMo-V2.5-Pro (Xiaomi)	42B / 1T (MoE)	Apache-2.0	54	Tied #1 by index; 1M context
3	DeepSeek V4 Pro (Max)	49B / 1.6T (MoE)	MIT	52	#1 agentic (GDPval-AA), best coding
4	GLM-5.1 (Z.ai)	40B / 744B (MoE)	MIT	51	Cleanest license, cheapest self-host
5	MiniMax-M2.7	10B / 230B (MoE)	Apache-2.0	50	Cheapest frontier-class inference
6	DeepSeek V4 Flash (Max)	13B / 284B (MoE)	MIT	47	Best $/intelligence, single-host
7	Llama 4 (Meta)	~600B+ class	Llama 4 Community License	— (not on AA index)	Ultra-long context (vendor: ~5M)
8	Qwen3.6-27B (Alibaba)	27B dense	Apache-2.0	—	Best small dense coder, OSI license
9	Gemma 4 (Google)	2B–31B (dense + 26B MoE)	Apache-2.0	—	On-device / laptop-class
10	Ring-2.6-1T (inclusionAI)	~63B / 1T (MoE)	MIT	— (vendor numbers only)	Adaptive-effort reasoning (unverified)

Index scores: Artificial Analysis open-source leaderboard, snapshot mid-May 2026. Where two DeepSeek V4 Pro effort modes are listed, this table uses Max Effort (52); High Effort scores 50.

Why is Kimi K2.6 the number-one open source LLM?

Moonshot AI released Kimi K2.6 on April 20, 2026: a Mixture-of-Experts model with 1 trillion total parameters, 32B active per token, a 256K context window, and native image + video input. Per Artificial Analysis, it scores 54 on the Intelligence Index — #4 of every model on Earth, behind only the current Anthropic, Google, and OpenAI flagships (57). No open-weight model has ever been this close to the closed frontier.

What moved the needle was agentic capability. Artificial Analysis measured a GDPval-AA Elo of 1520 for K2.6 (up from K2.5's 1309), a 96% score on τ²-Bench Telecom tool-use, and a hallucination rate that dropped from 65% (K2.5) to 39%. On the harder SWE-Bench Pro benchmark, Moonshot reports 58.6%, which independent coverage notes beats GPT-5.4 (57.7%) and Claude Opus 4.6 (53.4%) on that specific benchmark.

The license is Modified MIT: genuinely open weights, with a visible-attribution clause that only triggers for very large deployments (roughly 100M+ MAU or $20M+/month revenue). For nearly every company, it is effectively unrestricted.

Pick Kimi K2.6 if: you want the best all-round open model and the strongest agentic coding, you can absorb the operational cost of serving a 1T-parameter MoE (multi-host), and Modified MIT is acceptable to legal. Available self-hosted, via Moonshot's API, or through Novita, Baseten, Fireworks, and Parasail.

Skip if: you can't justify a multi-host serving cluster — GLM-5.1, MiniMax-M2.7, or DeepSeek V4 Flash run far cheaper on smaller footprints.

Is MiMo-V2.5-Pro really tied for first?

Yes — and it's the model most "2026 best open LLM" lists still omit. Xiaomi's MiMo-V2.5-Pro is a 1T-total / 42B-active MoE with a 1M-token context, Apache-2.0 licensed, and it ties Kimi K2.6 at 54 on the Artificial Analysis Intelligence Index. The smaller MiMo-V2.5 (310B total / 15B active, 1M context) scores 49.

Two things make MiMo notable: it matches the index leader while shipping a true Apache-2.0 license (no usage clauses at all), and the 1M context window is genuinely useful for codebase-scale retrieval. The trade-off is ecosystem maturity — fewer hosted providers and less battle-tested inference tooling than DeepSeek or Qwen.

Pick MiMo-V2.5-Pro if: you want index-leading capability under a clean Apache-2.0 license and need a long (1M) context, and you're comfortable being an early adopter on serving tooling.

Which open model is best for coding and agentic work?

DeepSeek released V4 Pro and V4 Flash on April 24, 2026, both MIT-licensed with a 1M-token context. Pro is a 1.6T-total / 49B-active MoE; Flash is 284B-total / 13B-active.

On the neutral index, DeepSeek V4 Pro (Max Effort) scores 52 — #2 among open weights, just behind Kimi K2.6 and MiMo-V2.5-Pro. But on the GDPval-AA agentic leaderboard it is #1 among all open-weight models: Artificial Analysis measured V4 Pro (Max) at 1554 Elo, ahead of GLM-5.1 (1535), MiniMax-M2.7 (1514), and Kimi K2.6 (1484) on real-world agentic work tasks. Its vendor-reported SWE-Bench Verified is 80.6%, which independent reporting notes ties Gemini and lands within margin of Claude Opus 4.6's 80.8% — the first time an open model has matched the closed coding frontier.

The economics are the other story. On May 22, 2026 DeepSeek made the prior 75% V4 Pro discount permanent (permanent since 2026-05-22), so V4 Pro now sits at $0.435 / $0.87 per 1M input/output tokens and V4 Flash at $0.14 / $0.28 — Flash performs comparably to Claude Sonnet 4.6 at a fraction of the cost. Source: api-docs.deepseek.com.

Pick DeepSeek V4 if: agentic coding is your primary workload, MIT is a hard requirement, or you're cost-sensitive at scale. Use Pro for the capability ceiling, Flash for almost everything else.

The NIST CAISI evaluation remains the most measured outside read: V4 Pro is competitive on math (96–97% across multiple tests) and cost, but still trails the closed frontier on some agentic-reasoning benchmarks. For internal/B2B engineering work this gap rarely bites; for public consumer products, add a safety layer regardless of model.

Companion guide

For the full continuously-updated landscape — every notable open-weights model, license, hosting cost, and deployment pattern — see our Open-Source LLMs Landscape (2026).

What about GLM-5.1 and the cleanest license?

GLM-5.1 from Z.ai (formerly Zhipu AI) is a 744B-total / 40B-active MoE with a 200K context, released under a plain MIT license — no usage-based clauses, no field-of-use restrictions, the cleanest open-weight license in the frontier tier. It scores 51 on the Artificial Analysis Intelligence Index and ranks second only to DeepSeek V4 Pro on the GDPval-AA agentic leaderboard (1535 Elo).

On Arena's Code Arena WebDev leaderboard, GLM-5.1 (1,534 Elo) actually edged out Kimi K2.6 (1,529) as of late April 2026. Its self-reported SWE-Bench Pro of 58.4% is essentially tied with Kimi K2.6's 58.6% — independent third-party confirmation across every GLM-5.1 sub-benchmark is still incomplete, so treat the vendor coding numbers as Z.ai-reported until more leaderboards verify them. Notably, GLM-5.1 was trained entirely on Huawei Ascend hardware, with no Nvidia chips.

Pick GLM-5.1 if: license cleanliness drives the decision (plain MIT, no clauses to read), you want the cheapest fully-loaded self-hosted option at sustained throughput, and SWE-Bench Pro performance is on your scorecard.

Where do MiniMax, Llama, Qwen, and Gemma land now?

MiniMax-M2.7 (open-weighted on Hugging Face, Apache-2.0) is the efficiency surprise: only 10B active / 230B total, yet it scores 50 on the Intelligence Index and reports 56.22% SWE-Pro and 57.0% Terminal-Bench 2. With the smallest active-parameter count of any frontier-class model here, it is the cheapest to serve at sustained throughput. Pick it when inference cost dominates and you can accept a slightly lower ceiling than Kimi/DeepSeek.

Llama 4 (Meta, 2025) is the long-context play: Llama 4 Scout offers a 10M context window at 109B / 17B active, and Llama 4 Maverick offers 1M context at 400B / 17B active. It is not currently on the Artificial Analysis Intelligence Index, so we can't give it a neutral composite score — its headline capability numbers are vendor-reported. It still ships under a Llama Community License (source-available, commercial use with a large-MAU carve-out), not an OSI-approved license. Pick Llama 4 if ultra-long context for RAG-on-everything is the requirement and the community license clears legal.

Qwen 3.6 is a strategy story. Alibaba's flagship Qwen3.6-Max-Preview went closed-weight (April 20, 2026) — its first proprietary flagship ever. But the open variants are strong: Qwen3.6-27B (dense, Apache-2.0) reports 77.2 SWE-Bench Verified and 59.3 Terminal-Bench 2.0 (matching Claude 4.5 Opus on that benchmark, per Alibaba), beating the much larger Qwen3.5-397B MoE on agentic coding. If you need a small dense coder under a true OSI license, this is the pick — just know the flagship is no longer open.

Gemma 4 (Google, April 2, 2026) is now Apache-2.0 (a meaningful upgrade from prior Gemma terms) and ships in 2B/4B effective sizes, a 31B dense, and a 26B MoE (3.8B active). Google reports 85.2 MMLU-Pro and 80.0 LiveCodeBench v6 for the larger sizes. Gemma's category of one remains on-device: a model that actually runs on a laptop or single consumer GPU, with int4 quantizations for desktop/mobile/air-gapped deployment.

Ring-2.6-1T (inclusionAI / Ant Group, ~May 8, 2026) is a ~1T-total / ~63B-active MoE under MIT with adaptive reasoning-effort modes. Its AIME 2026 (95.83) and GPQA Diamond (88.27) figures are vendor-reported only — no neutral third party has published verified Ring-2.6-1T results yet, so we list it but exclude it from any ranked claim until independent benchmarks land.

How do the licenses compare for commercial use?

Model	License	OSI-approved?	Gotchas
DeepSeek V4 (Pro + Flash)	MIT	Yes	None
GLM-5.1 / GLM-5	MIT	Yes	None — cleanest in the frontier tier
MiMo-V2.5 / V2.5-Pro	Apache-2.0	Yes	None
MiniMax-M2.7	Apache-2.0	Yes	None
Qwen3.6-27B	Apache-2.0	Yes	Open; the Max flagship is closed
Gemma 4	Apache-2.0	Yes	None (upgraded from prior Gemma terms)
Ring-2.6-1T	MIT	Yes	Benchmarks unverified by third parties
Kimi K2.6	Modified MIT	No (close)	Visible-attribution clause at ~100M+ MAU / $20M+/mo revenue
Llama 4	Llama 4 Community License	No	Large-MAU carve-out; attribution; use restrictions

If your legal team mandates OSI-approved licenses, your frontier shortlist is now large: DeepSeek V4 (MIT), GLM-5.1 (MIT), MiMo-V2.5-Pro (Apache-2.0), MiniMax-M2.7 (Apache-2.0), Qwen3.6-27B (Apache-2.0), and Gemma 4 (Apache-2.0). That's a real change from a year ago — clean licensing is no longer a reason to stay closed.

Which open source LLM should you pick for your workload?

Best all-round open model? Kimi K2.6 (AA Index 54, #1 open / #4 overall) or MiMo-V2.5-Pro (tied at 54, cleaner license).
Best agentic / real-world work? DeepSeek V4 Pro (#1 open on GDPval-AA, 1554 Elo).
Best coding (SWE-Bench Verified)? DeepSeek V4 Pro (80.6%, vendor-reported, ties closed frontier). On SWE-Bench Pro: Kimi K2.6 (58.6%) and GLM-5.1 (58.4%) lead.
Cleanest license (no clauses to read)? GLM-5.1 (MIT) or any Apache-2.0 model (MiMo, MiniMax, Qwen3.6-27B, Gemma 4).
Cheapest frontier-class inference? MiniMax-M2.7 (10B active) or DeepSeek V4 Flash ($0.14/$0.28 per M).
Ultra-long context for codebase-scale RAG? Llama 4 Scout (~10M, vendor) or DeepSeek V4 Pro / MiMo-V2.5-Pro (1M, neutral-scored).
On-device / laptop / air-gapped? Gemma 4.
Strict OSI-approved license only? DeepSeek V4, GLM-5.1, MiMo-V2.5-Pro, MiniMax-M2.7, Qwen3.6-27B, Gemma 4.
Best small dense coder? Qwen3.6-27B (77.2 SWE-Bench Verified, Apache-2.0).

For deeper, model-specific dives, see our Kimi K2.6 Complete Guide, DeepSeek V4 Complete Guide, and AI Coding Agents Complete Guide (2026).

How much does it cost to self-host these models?

The threshold math still holds, with cheaper models pushing it lower. A single H100 runs ~$2–4/hr on-demand; H200 ~$3–5/hr. Self-hosting beats hosted API once you cross roughly:

~5M tokens/day sustained for a single-host model (DeepSeek V4 Flash, MiniMax-M2.7, Gemma 4, Qwen3.6-27B).
~30–50M tokens/day for a multi-host frontier MoE (Kimi K2.6, DeepSeek V4 Pro, GLM-5.1, MiMo-V2.5-Pro). GLM-5.1's plain MIT license and 40B-active footprint make it the cheapest fully-loaded self-hosted option at sustained scale.

Below those thresholds, hosted APIs win on total cost once you price in ops time. For a deeper dive into the tradeoffs, see our self-hosting LLMs guide.

FAQ

What is the best open source LLM right now?

By the neutral Artificial Analysis Intelligence Index (May 2026), Kimi K2.6 leads all open-weight models at 54, ranking #4 of every model on Earth behind only the latest Anthropic, Google, and OpenAI flagships. Xiaomi's MiMo-V2.5-Pro is tied at 54.

Which open model is best for coding?

DeepSeek V4 Pro reports the highest SWE-Bench Verified (80.6%, vendor-reported, ties the closed frontier) and leads the GDPval-AA agentic leaderboard among open models. On the harder SWE-Bench Pro, Kimi K2.6 (58.6%) and GLM-5.1 (58.4%) are essentially tied at the top.

Is Kimi K2.6 really close to GPT and Claude?

On the Artificial Analysis Intelligence Index it scores 54 versus 57 for the top closed models from Anthropic, Google, and OpenAI — a three-point gap and the closest any open-weight model has come. It still trails on some long-horizon agentic-reasoning benchmarks.

Which open LLM has the cleanest license for commercial use?

GLM-5.1 (plain MIT, no clauses) and any Apache-2.0 model — MiMo-V2.5-Pro, MiniMax-M2.7, Qwen3.6-27B, and Gemma 4. DeepSeek V4 is also MIT. Kimi K2.6 is Modified MIT (an attribution clause only at very large scale); Llama 4 is a community license, not OSI-approved.

Can I run any of these on a single GPU?

Yes. Gemma 4 and Qwen3.6-27B run on a single consumer/datacenter GPU; DeepSeek V4 Flash and MiniMax-M2.7 run on a single host with quantization. Kimi K2.6, DeepSeek V4 Pro, GLM-5.1, and MiMo-V2.5-Pro are multi-host frontier MoEs.

Did Qwen stop being open source?

Partially. Alibaba's flagship Qwen3.6-Max-Preview went closed-weight (its first proprietary flagship). But the smaller open variants — Qwen3.6-27B and Qwen3.6-35B-A3B — remain fully open under Apache-2.0 and are strong agentic coders.

Should I trust vendor-reported benchmarks?

Treat them as a directional claim, not ground truth. This guide leans on the independent Artificial Analysis Intelligence Index and GDPval-AA where they exist, and explicitly labels vendor-only numbers (e.g., Ring-2.6-1T, Llama 4 long-context claims) so you can weight them accordingly before committing budget.

How fast is this list changing?

Very. Nine frontier-class open-weight models shipped in roughly six weeks (April–mid-May 2026). Re-validate your shortlist against the Artificial Analysis open-source leaderboard quarterly — the leader has changed more than once this year.

Need engineers who can actually run these models in production?

Picking a model is a one-week decision; operating a self-hosted LLM stack is a year-long engineering investment — multi-node serving, FP4/FP8 quantization, expert-routing tuning, eval pipelines against your real workload, and 24/7 on-call. If you're hiring vetted remote developers experienced with open-weight LLM infrastructure, inference optimization, and agentic coding harnesses, codersera.com/hire matches you with engineers who have shipped exactly this before, with a risk-free trial so you can validate fit before committing.