The Open-Source LLM Landscape in 2026

The full open-weight LLM landscape in 2026 — DeepSeek V4, Llama 4, Qwen 3.5, Gemma 4, Mistral, Phi-4 — with real benchmarks, license analysis, and a decision framework.

Last updated: May 1, 2026. The open-weight model landscape has changed more in the last 12 months than in the prior three years. DeepSeek pushed past 80% on SWE-bench Verified with weights anyone can download. Google shipped Gemma 4 under Apache 2.0. Alibaba's Qwen line went from "credible challenger" to "best-in-class on graduate-level reasoning." Meta's Llama 4 brought a 10M-token context window into the open. If you are still picking models based on a 2024 mental model, you are leaving capability, cost savings, and deployment flexibility on the table.

This guide is the definitive landscape overview for engineering and product leaders evaluating open-weight LLMs in 2026. We cover the model families that matter, the architectures behind them, their license fine print, hardware tiers, and a decision framework you can actually use. Numbers throughout are from official model cards, the ArtificialAnalysis and LiveCodeBench leaderboards, and lab release blogs as of late April 2026.

TL;DR

  • The frontier is open. DeepSeek V4-Pro hits 80.6% on SWE-bench Verified — within 0.2 points of Claude Opus 4.6 — under an MIT license. The capability gap between open and closed is now measured in single benchmark points, not generations.
  • MoE is the default at scale. Almost every flagship open model in 2026 is a sparse Mixture-of-Experts: DeepSeek V4-Pro (1.6T total / 49B active), Llama 4 Maverick (400B / 17B), Qwen 3.5 (397B / 17B), Mistral Large 3 (675B / 41B). Total parameters set the VRAM floor; active parameters set the inference cost.
  • Apache 2.0 has won the license war for permissive labs. Gemma 4, Qwen 3.5, Mistral Large 3, and Yi all ship under Apache 2.0. DeepSeek V4 ships MIT. Llama 4 keeps the Meta custom license with the 700M MAU clause.
  • Context windows are no longer a gate. Llama 4 Scout offers 10M tokens. DeepSeek V4 supports 1M. Gemma 4 medium offers 256K. The bottleneck is now retrieval quality and inference cost, not raw window size.
  • Specialization beats raw size on benchmarks. Qwen 3.5 leads open weights on GPQA Diamond at 88.4%. Gemma 4 31B hits 80% on LiveCodeBench at one-tenth the active params of frontier MoEs. Phi-4 14B beats 70B+ models on math reasoning.
  • Hardware tiers are now well-defined. 7B-14B runs on a 16GB GPU. 27B-32B needs an RTX 5090 (32GB) at Q4. 70B dense needs 48-64GB or dual cards. Frontier MoE (400B+) needs an 8-GPU H200/B300 box.
  • The interesting question shifted from "which model is best?" to "which model is best for this task at this latency budget under this license?"

Why Open-Source Matters in 2026

The argument for open-weight models stopped being ideological a year ago. The drivers now are concrete:

Cost. Hosted DeepSeek V4-Flash inference runs at a fraction of frontier closed-API pricing, and a self-hosted Gemma 4 4B can serve millions of internal requests for the price of a single GPU. For high-volume workloads — RAG over internal docs, code review bots, transcript summarization — the unit economics of open weights are not close.

Privacy and data residency. Regulated workloads (health, finance, legal, defense) cannot send raw inputs to third-party APIs. Open weights run inside your VPC, on your hardware, against your data, with logs you control. This is the single biggest reason enterprise CTOs are funding GPU clusters in 2026.

Fine-tuning. You cannot fine-tune Claude or GPT-5 on your proprietary corpus. You can fine-tune Qwen 3.5, Gemma 4, or Llama 4 in an afternoon with LoRA on a single H100. The closed labs offer fine-tuning APIs, but with strict data and policy constraints.

Vendor lock-in. Anthropic, OpenAI, and Google have all changed pricing, deprecated models, and tightened policies on 30 days' notice in the last year. Open weights insulate your roadmap from another lab's product decisions.

For a deeper look at the cost story specifically, see our DeepSeek V4 Pro pricing review and the broader DeepSeek V4 complete guide.

The Family Overview Matrix

Here is the landscape in one table. "Active" means parameters used per forward pass on MoE models — what your inference cost actually scales with. "Total" sets the VRAM floor.

FamilyFlagship (2026)Total ParamsActiveArchitectureContextLicenseReleased
DeepSeek V4V4-Pro1.6T49BMoE1MMITApr 2026
DeepSeek V4V4-Flash284B13BMoE1MMITApr 2026
Llama 4 (Meta)Maverick400B17BMoE (128 experts)1MLlama 4 CommunityApr 2025
Llama 4 (Meta)Scout109B17BMoE (16 experts)10MLlama 4 CommunityApr 2025
Qwen 3.5 (Alibaba)Qwen3.5-397B-A17B397B17BMoE256KApache 2.0Feb 2026
Qwen 3.6 (Alibaba)Qwen3.6-35B-A3B35B3BMoE256KApache 2.0Apr 2026
Gemma 4 (Google)Gemma 4 31B31B31BDense256KApache 2.0Apr 2026
Gemma 4 (Google)Gemma 4 27B A4B26B~4BMoE256KApache 2.0Apr 2026
Mistral Large 3Large 3 (2512)675B41BMoE256KApache 2.0Dec 2025
Phi-4 (Microsoft)Phi-4-reasoning-plus14B14BDense32KMIT2025
Hunyuan (Tencent)Hy3 Preview295B21BMoE256KTencent Hy CommunityApr 2026
Yi (01.AI)Yi 1.5 / Yi-Coder 9Bup to 34BDense200K+Apache 2.02024-2025
Falcon (TII)Falcon H1R 7B7B7BDense (hybrid)Falcon LLM 1.0 (Apache-derived)Jan 2026
Cohere Command R+C4AI Command R+104B104BDense128KCC-BY-NC (research)2024 (still relevant)
Stable LM 2Stable LM 2 12B12B12BDenseStability Community2024

The Benchmark Comparison

Real numbers, not vibes. SWE-bench Verified is the truest signal for "can this model do real engineering work." LiveCodeBench is contamination-resistant competitive coding. GPQA Diamond is graduate-level science reasoning. MMLU-Pro is the harder, ten-option successor to MMLU.

ModelSWE-bench VerifiedLiveCodeBenchGPQA DiamondMMLU-Pro
DeepSeek V4-Pro80.6%~93% (V4 Max variant)~80%~87%
DeepSeek V4-Flash79.0%~88%~78%~85%
Llama 4 Maverick~70%80.5%
Llama 4 Scout~64%~74%
Qwen 3.5 (397B-A17B)76.4%83.6%88.4%87.8%
Qwen 3.6-35B-A3B73.4%~80%~82%~84%
Gemma 4 31B~68%80%~75%85.2%
Mistral Large 3~65%~70%~70%~78%
Phi-4-reasoning-plus (14B)56.1%~74%
Hunyuan Hy3 Preview74.4%~78%~75%~83%
Falcon H1R 7B

A few observations from the table:

  • DeepSeek V4-Pro is the strongest open coder, closing in on Claude Opus 4.6.
  • Qwen 3.5 is the strongest open scientific reasoner, beating most closed models on GPQA Diamond.
  • Gemma 4 31B punches well above its weight class on coding — 80% LiveCodeBench from a 31B dense model is exceptional.
  • Llama 4 Maverick's MMLU-Pro of 80.5% beats GPT-4o, but it has fallen behind on coding benchmarks compared to the Chinese labs.
  • Phi-4 stays useful as a small reasoning specialist, not a generalist.

For a focused head-to-head on the two most discussed open models of April 2026, see Gemma 4 vs Llama 4 for local deployment. For a deep dive on DeepSeek V4-Flash specifically, see the V4-Flash deep dive, and for a Gemma 4 standalone review including local-deployment specifics see our Gemma 4 review.

One nuance worth flagging: benchmark scores are reported under different harnesses and prompt templates. A 2-3 point delta between the same model on two leaderboards is normal and usually reflects scaffold differences, not model quality. When you evaluate, fix the harness and the prompt and only then compare. The teams shipping the most impressive open-model deployments in 2026 are the ones who built rigorous internal evals before they touched production traffic.

License and Commercial-Use Matrix

This is where engineering teams get burned. "Open weights" is not the same as "open source," and "Apache 2.0" is not the same as "Llama 4 Community." Read the license before you ship.

ModelLicenseCommercial UseRedistributionFine-tune & DerivativesNotable Restrictions
DeepSeek V4 Pro / FlashMITYes, unrestrictedYesYesNone of substance
Qwen 3.5 / 3.6Apache 2.0Yes, unrestrictedYesYesStandard attribution
Gemma 4Apache 2.0Yes, unrestrictedYesYesStandard attribution
Mistral Large 3Apache 2.0Yes, unrestrictedYesYesStandard attribution
Phi-4MITYes, unrestrictedYesYesNone of substance
Yi 1.5 / Yi-CoderApache 2.0Yes, unrestrictedYesYesStandard attribution
Llama 4Llama 4 Community LicenseYes, with carve-outsYesYesCompanies with 700M+ MAU need a separate Meta license; products built on Llama must include "Built with Llama" attribution
Hunyuan Hy3Tencent Hy Community LicenseConditionalConditionalYesCustom; review before deployment
Falcon H1RFalcon LLM 1.0 (Apache-derived)Yes, royalty-freeYesYesNo-litigate clause; attribution required
Cohere Command R+CC-BY-NC 4.0No (research only)Yes (non-commercial)Yes (non-commercial)Commercial use requires Cohere license
Stable LM 2 12BStability CommunityRequires Stability membershipRestrictedYesMembership tier gates commercial deployment

Bottom line for legal review:

  • If you want zero ambiguity: DeepSeek V4 (MIT), Qwen 3.5 (Apache 2.0), Gemma 4 (Apache 2.0), Mistral Large 3 (Apache 2.0), Phi-4 (MIT). These are clean for any commercial deployment.
  • Llama 4 is fine unless you are at FAANG scale (700M+ MAU), but factor in the attribution requirement.
  • Cohere Command R+ is a research artifact. Do not ship it to production without a Cohere commercial agreement.
  • Tencent and Stability have custom licenses worth a 30-minute legal read before integration.

How to Choose: A Decision Framework

Walk down this list, in order. The first answer that matches your constraint picks the model.

  1. License must be permissive (legal/compliance gate). Eliminate Cohere Command R+, Stability community-license models, and (if you are above 700M MAU) Llama 4. You are left with the DeepSeek/Qwen/Gemma/Mistral/Phi/Yi pool.
  2. Hardware budget.
    • Single 16GB consumer GPU: Phi-4 14B, Gemma 4 4B, Yi 9B, Falcon H1R 7B.
    • Single 32GB GPU (RTX 5090, A6000): Gemma 4 31B at Q4, Qwen 3.6-35B-A3B at Q4.
    • Dual 24GB or 48GB single card: Llama 3.3 70B class, dense 70B variants at Q4.
    • 8x H100/H200/B300 server: DeepSeek V4-Pro, Llama 4 Maverick, Mistral Large 3, Qwen 3.5-397B.
  3. Primary task.
    • Code generation / agent loops: DeepSeek V4-Pro > DeepSeek V4-Flash > Qwen 3.5 > Gemma 4 31B.
    • Scientific / graduate reasoning: Qwen 3.5 > DeepSeek V4-Pro > Gemma 4 31B.
    • Long-document RAG: Llama 4 Scout (10M ctx) > DeepSeek V4 (1M) > Gemma 4 (256K).
    • Tool-use / function calling: Qwen 3.5, Mistral Large 3, Llama 4 Maverick (all strong; pick on license).
    • On-device / edge inference: Phi-4 14B, Gemma 4 E2B/E4B, Falcon H1R 7B.
  4. Latency budget. MoE models with smaller active-parameter counts (V4-Flash 13B active, Qwen3.6-35B-A3B 3B active) are dramatically faster than dense 70B models at similar quality. Active params, not total params, drive latency.
  5. Multilingual / regional needs. Qwen and Hunyuan dominate Chinese. Mistral Large 3 is strongest on European languages. Llama 4 covers ~30 languages well. Yi is bilingual EN/ZH first.

For a worked example of comparing two specific options, see Gemma 3 vs Qwen 3 (much of the methodology carries forward to the 4/3.5 generation), and the Hunyuan vs Qwen 3 comparison. If you are already on DeepSeek and weighing alternatives, our DeepSeek V4 alternatives piece walks through the trade-offs.

Architecture Notes Worth Knowing

MoE is the dominant pattern at scale. Every flagship open model above ~100B parameters in 2026 is a Mixture-of-Experts. The reason is straightforward: at training and serving scale, MoE lets you grow total capacity (knowledge, multilingual coverage, niche skills) without proportionally growing inference cost. DeepSeek V4-Pro with 1.6T total / 49B active runs faster than a hypothetical 70B dense model of similar quality.

Dense still wins under ~32B. Below the MoE break-even, dense architectures are simpler to fine-tune, simpler to serve, and easier to quantize. Gemma 4 31B (dense) is the strongest sub-32B coder. Phi-4 14B (dense) is the strongest small-model reasoner.

Long-context tricks have matured. DeepSeek V4 introduced Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), cutting KV cache to 10% of V3.2's at 1M tokens. Llama 4 Scout uses interleaved local/global attention for its 10M window. These are not free — recall quality at the tail of long contexts is still the main failure mode — but the engineering has stopped being the bottleneck.

Reasoning specialization is now a separate axis. Phi-4-reasoning-plus, DeepSeek-R1-derived variants, and the Qwen3.6 reasoning lines are tuned with RL on chain-of-thought traces. They cost more tokens per answer but produce dramatically better math/science/code results. Treat reasoning models as a distinct product category from general chat models.

Hybrid attention and state-space ideas are creeping in. Falcon H1R uses a hybrid Mamba/Transformer block, and several smaller labs are shipping pure SSM (state-space model) variants for very-long-context inference. None of these have yet displaced the Transformer-MoE flagship pattern at the top of the leaderboards, but they are showing up in the long tail and may matter more in 2027 if the trend continues.

Quantization is now part of the model release. Most labs ship official Q4, Q5, and Q8 GGUF or AWQ quantizations alongside the FP16 weights. The days of waiting two weeks for the community to quantize a release are over — for production planning, assume you can deploy at Q4 the day a model drops.

Hardware Tiers Cheat Sheet

TierVRAMExample HardwareWhat It Runs (Q4)Approx. Cost
Edge / laptop8-16 GBRTX 4060 Ti 16GB, M3 Pro 18GBPhi-4 14B, Gemma 4 4B, Yi 9B~$500-2,000
Power workstation24-32 GBRTX 5090, RTX 4090Gemma 4 31B, Qwen3.6-35B-A3B~$2,000
Pro workstation48-96 GBRTX PRO 6000 Blackwell, dual 509070B dense at Q4, 100B+ MoE at Q4~$4,000-8,500
Single server1x 80GB H100/H200SXM5 H100/H200DeepSeek V4-Flash full-precision activations, Mistral Large 3 quantized~$25K-40K
Frontier cluster8x H200 / B300NVL72, B300 serverDeepSeek V4-Pro, Llama 4 Maverick, Qwen 3.5-397B at production speed$200K+

Quantization is the lever that moves you down a tier. Q4_K_M cuts VRAM ~75% versus FP16 with minimal quality loss for most use cases. Below Q4 (Q3, Q2) quality starts dropping noticeably; only use it if you have nowhere else to go.

What's Overhyped, What's Underrated

Overhyped:

  • 10M context windows. Llama 4 Scout's 10M token window is technically real, but practical recall past ~500K tokens is still shaky on every model we have tested. For real RAG workloads, retrieval quality + a 128K window beats raw 10M every time.
  • Total parameter count as a marketing number. "1.6 trillion parameters" sounds impressive; what matters is the 49B active and how cleanly the router picks experts. Compare on benchmarks, not headline counts.
  • Reasoning-only specialist models for general workloads. If you put Phi-4-reasoning-plus in a chat app, users will hate the latency. Use reasoning models where the user expects to wait — code agents, math/science assistants, planning loops.
  • Falcon's "best 7B globally" claims. Falcon H1R is good, but the AIME-only benchmark suite cherry-picks math reasoning. It does not generalize as cleanly to coding or open-ended tasks as the Phi-4 or Gemma 4 4B numbers suggest at similar size.

Underrated:

  • Gemma 4 31B. A dense 31B model from Google under Apache 2.0 with 80% LiveCodeBench and 85.2% MMLU-Pro is the most practical "single GPU" deployment of 2026. Most teams should default to this and only move up the stack if they hit a real ceiling.
  • DeepSeek V4-Flash. Within 1.6 points of V4-Pro on SWE-bench at one-fifth the active params. For agent loops where you spend 30+ tool calls per task, the latency math overwhelmingly favors Flash.
  • Qwen 3.6-35B-A3B. 3B active parameters, 73.4% on SWE-bench Verified. The economics for self-hosted code review and PR triage are absurd.
  • Yi-Coder 9B. Quietly one of the best small coding models, 85% HumanEval at a size you can run on a Mac.
  • Mistral Large 3 for European compliance. French-headquartered, Apache 2.0, EU data residency story that DeepSeek and Qwen cannot match for some buyers. Often the only acceptable option for regulated EU customers.

FAQ

Is DeepSeek V4 actually open source, or just open weights?

It is MIT-licensed, which is open source by the OSI definition for the released artifacts (weights, model code, inference code). The training data and training code are not released, so it is not "open" in the strictest reproducibility sense, but for downstream commercial use and modification, MIT is as permissive as it gets.

Can I use Llama 4 in a startup?

Yes. The Llama 4 Community License allows commercial use; the 700M MAU restriction only kicks in at hyperscale. You do need to add a "Built with Llama" attribution to your product and follow Meta's acceptable-use policy. Most startups can ship Llama 4 without legal friction.

What is the cheapest model that is "good enough" for production code review?

Qwen 3.6-35B-A3B or Gemma 4 31B. Both run on a single RTX 5090 at Q4, both clear 70%+ on SWE-bench Verified, both ship under Apache 2.0. Pick Qwen for stronger raw coding, Gemma for cleaner instruction-following.

Does fine-tuning still matter, or is base capability good enough?

Fine-tuning matters for domain language (legal, medical, internal jargon), tool schemas, and house style. It rarely improves raw reasoning. Use LoRA/QLoRA on a base open-weight model rather than full fine-tunes.

What about multimodal models?

Llama 4 is natively multimodal (text+image+video). Gemma 4 has vision-capable variants. Qwen ships dedicated VL models. For text-only workloads, ignore multimodal capability; it adds VRAM cost and complexity.

Are Chinese-origin models safe for US enterprises?

The model weights themselves carry no callback or telemetry — they are static files. Self-hosted Qwen, DeepSeek, or Hunyuan does not phone home. The risk surface is the API-hosted versions (where requests go to Chinese infrastructure). Your security/legal team should weigh in on training-data provenance and supply-chain considerations, but for self-hosted inference there is no technical leak path.

How do I evaluate a new open model honestly?

Run it on three things: (1) your real production prompts with held-out outputs, (2) a contamination-resistant benchmark like SWE-bench Verified or LiveCodeBench, and (3) a latency/cost test at your real concurrency. Public benchmarks are a filter, not a verdict.

What is the difference between SWE-bench and SWE-bench Verified?

SWE-bench Verified is a 500-task hand-curated subset of SWE-bench where each task has been confirmed to be solvable and to have correct test cases. It is the version everyone reports in 2026; raw SWE-bench numbers from 2024 are no longer comparable.

Should I run inference myself or use a hosted open-weight provider?

Hosted (Together, Fireworks, DeepInfra, Groq) wins for spiky workloads where you cannot keep a GPU warm. Self-hosted wins past ~10M tokens/day of steady traffic, where amortizing GPU cost beats per-token pricing. Run the math on your actual usage; do not assume.

What about open-source reasoning/agent frameworks?

Models are only half the story. Pair an open model with frameworks like vLLM (serving), SGLang (structured outputs), and LangGraph or smolagents (orchestration). The model choice and the harness choice are separate decisions.

How fast is the open-source field moving in 2026?

A new flagship-tier open model has dropped roughly every 4-6 weeks for the past year. Build your evaluation pipeline so swapping models is a config change, not a port. The team that can re-benchmark and ship the new model in a day captures the gains; the team that can't pays the closed-API premium another quarter.

What's the right way to think about open vs closed in 2026?

Closed (Claude, GPT-5, Gemini 3) still leads at the absolute frontier by 1-3 benchmark points and ships richer tool ecosystems. Open is "good enough" for the vast majority of production work, dramatically cheaper at scale, and the only option for privacy-sensitive deployments. Most serious teams now run a hybrid: open weights for high-volume internal workloads, closed API for the highest-stakes external user-facing surface.

Where can I track this landscape day-to-day?

The ArtificialAnalysis leaderboard, the LiveCodeBench leaderboard, Hugging Face's Open LLM Leaderboard v2, and the LMSYS / Arena rankings are the four dashboards worth bookmarking. Treat any single one as a single signal, not a verdict.

Next Steps

The fastest way to operationalize this landscape is to put it in the hands of someone who has shipped open-weight models in production — fine-tuned them, served them at scale, and made the latency/cost trade-offs in real systems. Models change every six weeks; the engineer who knows how to evaluate, deploy, and re-deploy them is the durable asset.

Hire a Codersera-vetted ML engineer who has deployed open-weight models in production. Our developers have shipped DeepSeek, Qwen, Llama, and Gemma into live workloads — RAG systems, agent loops, on-prem inference clusters, fine-tuning pipelines — for teams that needed the speed and the privacy that closed APIs cannot give them. Skip the months of evaluation cycles and bring in the operator who has already done it.