Llama 4: The Complete Developer Guide (2026)

Definitive 2026 guide to Meta's Llama 4: variants, real benchmarks, license restrictions, hosted-provider pricing, self-hosting, and competition vs DeepSeek V4, Qwen 3.5, and Gemma 4.

Last updated: May 1, 2026

Llama 4 is the open-weight model family that forced every other lab to publish a serious mixture-of-experts checkpoint. It is also the model family that came with the loudest licensing footnote, the most public benchmark controversy of the cycle, and a 10-million-token context window that nobody else has matched. If you are planning to ship a product on top of Llama 4 in 2026, the practical questions are not "is it good?" but "which variant, which provider, under which license, against which alternative."

This guide answers those questions for engineering leaders, ML platform teams, and developers who have to deploy, fine-tune, or pay for Llama 4 in production. It pulls together Meta's own model cards, the Llama 4 Community License, current hosted-provider pricing, public benchmarks (including the ones that look bad for Llama 4), self-hosting hardware tiers, and a frank read on when Llama 4 is the right choice and when it is not.

TL;DR

  • The herd: Three variants, all MoE. Scout (109B total / 17B active / 16 experts / 10M context). Maverick (≈400B total / 17B active / 128 experts / 1M context). Behemoth (≈2T total / 288B active / 16 experts) was previewed but never publicly released.
  • Architecture: Native multimodal (text + image, early fusion), interleaved attention with NoPE layers ("iRoPE") to push context length, MoE routing for efficient inference at frontier scale.
  • License: Llama 4 Community License — commercial use allowed below 700M MAU, with EU multimodal restrictions and an Acceptable Use Policy. Not OSI-open, not Apache, not MIT.
  • Reality check: Public benchmarks and community testing put Llama 4 Maverick behind DeepSeek V4 and Qwen 3.5/3.6 on coding and hard reasoning. The LMArena "experimental" submission controversy (April 2025) was real and changed how the model is perceived.
  • Where it wins: Long-context retrieval (Scout), multimodal vision in an open-weight model, hosted price-performance on Groq, and a proven self-hosting story with vLLM and SGLang.
  • Where it loses: SWE-bench-style end-to-end coding agents, regulated EU deployments needing the vision modality, and any project that wants a clean MIT/Apache license.

1. The Llama 4 family at a glance

Meta released Llama 4 on April 5, 2025. Unlike Llama 2 and Llama 3, the entire Llama 4 family is mixture-of-experts. That is the single biggest architectural change. Instead of running every parameter on every token, an MoE model routes each token to a small subset of "expert" sub-networks. Total parameter count goes up; active parameter count per token stays small.

The practical effect: Llama 4 Maverick has roughly 400B parameters on disk but only ~17B active per forward pass, so its inference cost behaves like a 17B model while its capacity behaves like a much larger one. That is the trade Meta made.

VariantTotal paramsActive paramsExpertsContextMultimodalStatus
Llama 4 Scout109B17B1610M tokensText + imageReleased
Llama 4 Maverick≈400B17B1281M tokensText + imageReleased
Llama 4 Behemoth≈2T288B16Text + imagePreview only, not released

Both released variants were pretrained on roughly 22 trillion tokens of mixed text, image, and video data, with native multimodality from day one (early fusion of text and vision tokens into a unified backbone, rather than a vision encoder bolted onto a frozen LLM).

2. Architecture: MoE, iRoPE, and the 10M context claim

Three architectural choices matter for engineers deciding whether Llama 4 fits:

Mixture-of-Experts routing

Scout uses a "full" MoE pattern across its layers. Maverick uses an alternating dense/MoE layout — experts are applied in roughly half the layers, with dense layers in between. This matters for inference frameworks: vLLM and SGLang support both layouts, but the active-parameter advertising is misleading if you are sizing GPU memory. You still need to load all expert weights into VRAM (or stream them, which is slow).

iRoPE for long context

Llama 4 interleaves NoPE layers (no positional encoding, full causal attention over the entire context) every fourth layer with three RoPE layers using chunked attention, plus inference-time temperature scaling on attention. Meta calls this iRoPE. The result is the 10M-token context on Scout, with what Meta reports as perfect needle-in-the-haystack retrieval across that range.

Independent testing (Andri.ai, dev.to community runs) confirms strong retrieval but flags a separate issue: precision degrades for tasks that require reasoning over the long context, not just retrieval. A 10M-token retrieval window is not the same as a 10M-token reasoning window. If you are pushing a full monorepo into the prompt to ask a refactoring question, expect uneven results.

Native multimodality

The vision encoder is a Meta-trained variant of MetaCLIP, jointly trained with a frozen Llama backbone so the encoder produces tokens the LLM can natively consume. Llama 4 has been validated for up to five input images per prompt. This is competitive with GPT-4o and Gemini 2.0 Flash on standard vision benchmarks and is the main reason teams keep Llama 4 in the running for multimodal use cases despite the licensing friction.

3. License: read this before you ship

Llama 4 ships under the Llama 4 Community License Agreement. Commercial use is permitted, but it is not OSI-approved open source. The clauses that actually matter:

  • 700M MAU threshold. If your product or service had more than 700 million monthly active users in the calendar month before Llama 4's release (April 2025), you must request a separate license from Meta, granted at Meta's "sole discretion." This is the same clause that has been in every Llama license since Llama 2 and is aimed squarely at hyperscaler competitors.
  • EU restriction on multimodal. The Llama 4 multimodal models cannot be used by, or distributed to, individuals or companies "domiciled in" the EU. This is Meta's response to AI Act ambiguity. The text-only paths are not blocked, but the vision capability — the headline feature — is off-limits for EU-based deployments without bespoke arrangements.
  • Acceptable Use Policy. Standard restrictions on illegal use, weapons development, CSAM, election interference, and so on. Read it; it is short.
  • Attribution. Distributions of Llama 4 or fine-tunes must include "Built with Llama" attribution and a copy of the license.

If your legal team requires an Apache 2.0 / MIT model, Llama 4 is out. DeepSeek V4 is MIT-licensed and is the most direct frontier-class alternative in that case.

4. Benchmarks: what is real, what is marketing

Meta's launch deck claimed Maverick beats GPT-4o and Gemini 2.0 Flash on a broad range of benchmarks. The community spent April 2025 stress-testing those claims. The summary: Llama 4 is a competent frontier model, but on the benchmarks engineers actually care about — coding, hard reasoning, agentic tool use — it is not the leader.

BenchmarkLlama 4 ScoutLlama 4 MaverickDeepSeek V4Qwen 3.6Notes
MMLU-Pro~74~8092.8~88Knowledge + reasoning
GPQA Diamond~57~70~8286.0PhD-level science
LiveCodeBench~32~43~62~55Contamination-resistant coding
SWE-Bench Verified~14~24~55~49 (Pro)End-to-end repo bug fixes
Needle-in-Haystack (10M)≈100%n/a (1M)n/an/aLong-context retrieval

Numbers are approximate and rounded from public reports (Meta model card, ArtificialAnalysis Intelligence Index, llm-stats, Composio, Spheron benchmarks). Treat them as directionally correct, not exact. Where a model has both reasoning and non-reasoning modes, scores reflect the best non-reasoning single-pass mode for fairness.

The pattern is consistent: Llama 4 holds its own on knowledge and long-context retrieval, but trails DeepSeek V4 and Qwen 3.5/3.6 on coding and hard reasoning. The comparison with Claude 3.7 Sonnet and the comparison with GPT-4.5 tell the same story for closed-source frontier models.

The LMArena episode

Worth noting because it shapes how the market reads any Llama 4 benchmark today. In April 2025, Meta submitted "Llama-4-Maverick-03-26-Experimental" to LMArena — a variant tuned for human-preference voting, distinct from the public release weights. It topped the leaderboard. LMSYS later acknowledged the variant was not labeled clearly enough, and the public release of Maverick performs noticeably worse on the same arena.

Meta's VP of GenAI denied training on test sets. The community read it as benchmark gaming regardless. Practical impact: discount any Llama 4 chart that cites a single LMArena number, and look at code- and reasoning-specific benchmarks instead.

5. Hosted pricing: where to actually run it

Llama 4 is broadly available across the hosted-inference ecosystem. The price spread is significant — Bedrock and Azure cost roughly 3-5x what specialty inference shops charge. Numbers below are public list prices as of late April 2026, per million tokens.

ProviderScout (in / out)Maverick (in / out)Notes
Groq$0.11 / $0.34$0.50 / $0.77Fastest tokens/sec; LPU hardware
Together AI$0.18 / $0.59$0.27 / $0.85Mature, broad model catalog
DeepInfra~$0.08 / $0.30~$0.20 / $0.60Cheapest blended price
Fireworks AI~$0.15 / $0.60~$0.22 / $0.88Strong fine-tune hosting
AWS Bedrock~$0.17 / ~$0.66~$0.35 / $0.80IAM-native, expensive at scale
Vertex AI (GCP)Listed via Model GardenListed via Model GardenPricing tracks Bedrock
Azure AI FoundryListedListed15–40% over direct API pricing

If you only care about price, DeepInfra and Groq are the floor. If you care about latency on Maverick, Groq's LPU is unmatched for short prompts. If you need an enterprise audit trail, Bedrock or Azure is what your procurement team will accept. If you need fine-tune hosting, Fireworks and Together are the practical choices.

6. Self-hosting: hardware tiers and serving frameworks

You self-host Llama 4 for one of three reasons: data residency, per-token cost at very high volume, or fine-tune deployment without sending weights to a vendor. Hardware sizing depends on quantization.

Hardware tiers

  • Scout, Q4 quantized: Single 24-48GB GPU (RTX 4090 / 6000 Ada / A6000) for usable throughput; 8GB VRAM is theoretically possible at the most aggressive quants but not production-grade. ~16GB system RAM minimum.
  • Scout, FP8/FP16: Single H100 80GB or 2x A100 40GB. The "single H100" claim Meta makes is technically true at FP8.
  • Maverick, Q4 quantized: 2-4x H100 80GB or equivalent. The full expert weights still have to live in memory.
  • Maverick, FP16: 8x H100 node (640GB HBM total). This is the production target.

Detailed install walkthroughs by OS: Ubuntu, macOS, Windows. For comparison with the smaller-footprint open model, see Gemma 4 vs Llama 4 local deployment.

Serving framework choice

  • vLLM — the default. Mature MoE support, broad community, easy OpenAI-compatible server. Use this unless you have a reason not to.
  • SGLang — better for shared-context workloads (chat, RAG, agents) thanks to RadixAttention; community reports up to ~29% throughput gains over vLLM in shared-prefix scenarios.
  • TGI (Hugging Face) — now in maintenance mode; HF themselves point new users at vLLM or SGLang. Avoid for new deployments.
  • Ollama — for laptops, dev machines, and small-team prototypes. Not a production serving stack.

7. Fine-tuning Llama 4

Three viable paths:

  • Unsloth — currently the only stack with working 4-bit QLoRA for Llama 4 Scout, with claimed ~1.5x speedup and ~50% VRAM savings versus Flash Attention 2 baselines. The right choice if you are tuning on a single 80GB card.
  • torchtune — Meta's first-party PyTorch library. Supports full fine-tunes, LoRA, QLoRA, and RLHF/RLVR. Best when you want minimal abstraction and you trust your own infra.
  • Axolotl / LlamaFactory — config-driven, multi-GPU friendly, broad model coverage. Good for teams running tunes across many model families.

For Maverick, expect to need at least an 8x H100 node for any meaningful tune — even with QLoRA, the expert tensors are large. Most teams who fine-tune Llama 4 are tuning Scout and serving Maverick stock.

8. How Llama 4 stacks up against the rest of the open-weight field

ModelLicenseCodingReasoningLong contextMultimodalBest for
Llama 4 MaverickLlama 4 CommunityMidMid1MYesMultimodal apps, broad ecosystem support
Llama 4 ScoutLlama 4 CommunityMidMid10MYesLong-document retrieval, single-GPU serving
DeepSeek V4MITTop tierTop tier128KLimitedCoding, math, frontier reasoning
Qwen 3.6 (35B-A3B)Apache 2.0Top tier (sub-40B)Top tier (sub-40B)1MYesBest price/perf in the small-MoE class
Gemma 4 31BGemma termsMidAbove Scout on GPQA128KYesOn-device, edge, single-GPU inference
Mistral Medium 3.5Mistral commercialMidMid128KLimitedEU-friendly hosted serving

Detailed pairwise reads: Llama 4 vs Mistral 7B, Gemma 4 vs Llama 4, DeepSeek V4 complete guide.

9. Known limitations

  • Coding gap. Maverick trails DeepSeek V4 and Qwen 3.6 on SWE-bench Verified and LiveCodeBench by a meaningful margin. If your primary workload is code generation, agentic refactoring, or repo-level bug fixing, Llama 4 is not the strongest open-weight choice in 2026.
  • Long-context reasoning, not just retrieval. Scout's 10M context is real for needle-in-haystack tasks but degrades on tasks that require chained reasoning over the full window. Test on your actual workload before committing.
  • EU multimodal restriction. Vision is unavailable for EU-domiciled licensees. This is a hard block for many EU-based products.
  • 700M MAU clause. Not a problem for most companies. A serious problem if you are a hyperscaler or social platform.
  • License is not OSI-open. No "Open Source" claim, no Apache/MIT permissiveness, attribution requirements on derivatives.
  • LMArena trust deficit. The April 2025 "experimental" submission episode means leaderboard scores for Llama 4 are read with extra skepticism. Use task-specific benchmarks instead.
  • MoE memory tax. "17B active" is not the same as "17B model." You still need to load all experts. Plan VRAM for total parameters, not active.
  • Tool-use and function-calling reliability. Community testing reports inconsistent JSON adherence and tool-call formatting compared with Claude 3.7 and DeepSeek V4. Heavy agent stacks may need extra guardrails.

10. When to choose Llama 4

Choose Llama 4 if at least two of the following are true:

  • You need open-weight multimodal with text + image in the same model and you are not EU-domiciled.
  • You need a context window beyond 1M tokens, and your task is retrieval-shaped (find this clause in this contract) rather than reasoning-shaped (synthesize an argument across the entire contract).
  • Your inference budget is tight and you want Groq-class latency or DeepInfra-class price-per-token without operating your own GPUs.
  • Your stack is already on Meta's ecosystem (Llama Stack, torchtune, Meta-trained MetaCLIP) and you want continuity.
  • You want a battle-tested open-weight base for fine-tuning, with strong tooling (Unsloth, torchtune, Axolotl, LlamaFactory) and broad hosted-fine-tune support.

Skip Llama 4 if your top priority is frontier coding (DeepSeek V4 or Qwen 3.6 win), if you require Apache/MIT licensing (DeepSeek V4, Qwen), or if you are an EU-domiciled team that needs the vision modality.

FAQ

Is Llama 4 free to use commercially?

Yes, with conditions. The Llama 4 Community License permits commercial use for licensees with fewer than 700M MAU as of April 2025. Above that threshold you must request a separate license from Meta. Attribution ("Built with Llama") is required on derivatives.

Can I use Llama 4 in the European Union?

The text-only paths are usable, but the multimodal (vision) capabilities are excluded for EU-domiciled licensees under the current license terms. If you need image input in the EU, look at Qwen 3.5/3.6 VL, Mistral, or hosted closed-source models instead.

What is the difference between Llama 4 Scout and Maverick?

Both have 17B active parameters per token. Scout is 109B total with 16 experts and a 10M-token context. Maverick is ≈400B total with 128 experts and a 1M-token context. Maverick is the higher-capacity model; Scout is the long-context specialist.

Was Llama 4 Behemoth ever released?

No. Behemoth (≈2T parameters, 288B active, 16 experts) was previewed at the April 2025 launch as still in training. As of May 2026, public weights have not shipped.

How does Llama 4 Maverick compare to GPT-4o?

On Meta's launch benchmarks, Maverick edged GPT-4o on several multimodal and long-context tasks. On independent code and reasoning benchmarks (LiveCodeBench, SWE-Bench Verified), GPT-4o and successor closed models remain ahead. See the Llama 4 vs GPT-4.5 comparison for the head-to-head.

How does Llama 4 compare to DeepSeek V4?

DeepSeek V4 wins on coding (LiveCodeBench, SWE-Bench Verified), hard reasoning (GPQA, MMLU-Pro), and licensing (MIT vs Llama Community). Llama 4 wins on multimodality, ecosystem maturity, and Scout's 10M context. Pick DeepSeek for code agents; pick Llama for multimodal apps.

What hardware do I need to run Llama 4 Scout locally?

A single 24GB consumer GPU runs Scout at Q4 quantization. A single H100 80GB runs it at FP8. For full FP16 you want 2x A100 80GB or equivalent. See the OS-specific guides for Ubuntu, macOS, and Windows.

Which hosted provider has the cheapest Llama 4?

DeepInfra typically has the lowest blended per-token price. Groq has the lowest latency. Together AI is the best balance of price, latency, and feature coverage. Bedrock and Azure are 3-5x the dedicated-inference shops but are the realistic choices when you need IAM, VPC, or enterprise audit.

Is the 10M context window real?

Real for retrieval tasks (needle-in-haystack) — Scout achieves near-perfect retrieval across 10M tokens. Less reliable for tasks that require chained reasoning over that full window. Validate on your workload before committing to the 10M context as a product feature.

Should I fine-tune Scout or Maverick?

Fine-tune Scout for domain adaptation, instruction tuning, and downstream tasks where 17B active and 109B total is sufficient. Use Maverick stock — fine-tuning it requires multi-node H100 infrastructure that most teams will not invest in. Unsloth is the most accessible Scout fine-tune path; torchtune is the best path if you want first-party tooling.

What was the LMArena controversy?

Meta submitted a variant called "Llama-4-Maverick-03-26-Experimental" — tuned for human-preference voting and distinct from the public release weights — to LMArena. It topped the leaderboard. LMSYS later acknowledged the labeling was not sufficiently clear, and the released Maverick performs noticeably worse on the same arena. Treat single-number leaderboard claims for Llama 4 with skepticism.

Does Llama 4 support function calling and tool use?

Yes, both Scout and Maverick support tool calling, and the model card documents the prompt format. Community reports note inconsistent JSON adherence compared with Claude 3.7 or DeepSeek V4, so production agent stacks should add validation and retry logic.

Can I run Llama 4 on a Mac?

Scout runs on Apple Silicon Macs with 32GB+ unified memory at Q4 quantization, via Ollama or llama.cpp. Maverick is impractical on consumer Apple hardware due to total parameter count. See Running Llama 4 on Mac.

Is Llama 4 a good choice for a coding agent?

Not the best choice in 2026. DeepSeek V4 and Qwen 3.6 lead on SWE-bench Verified and LiveCodeBench. Llama 4 is competent for general code completion and explanation but trails on agentic, repo-level tasks. If coding is the primary workload, choose accordingly.

Next steps

Llama 4 in production is a real engineering project: license review, provider selection, hardware sizing if you self-host, fine-tuning if you have proprietary data, and evaluation on your actual workload rather than launch-deck benchmarks. Most teams underestimate the evaluation work and overestimate how much the headline numbers transfer to their domain.

If you need senior Python and ML engineers who have shipped Llama, DeepSeek, or Qwen workloads in production — including vLLM/SGLang serving, QLoRA fine-tunes, and multimodal pipelines — Codersera matches you with vetted, remote-ready developers in days, not months. Hire a Codersera-vetted Python or ML engineer and extend your team with someone who already knows the trade-offs in this guide.