Running Llama 4 on Windows: Step-by-Step Install Guide (2026)

Running Llama 4 on Windows: Step-by-Step Install Guide (2026)

Last updated April 2026 — refreshed for current model/tool versions.

Meta's Llama 4 family (Scout, Maverick, and the still-internal Behemoth) is a Mixture-of-Experts (MoE) lineup with 17B active parameters and a native 10M-token context on Scout. Running it on Windows in 2026 is no longer exotic, but it is also not "double-click an .exe and you're done." This guide gives you the shortest verified path from a clean Windows 11 install to a working Llama 4 chat — Ollama for the simple route, llama.cpp for the fast/granular route, and WSL2 for everything in between.

What changed in 2026Llama 4 Scout (109B total / 17B active, 16 experts, 10M ctx) and Maverick (400B total / 17B active, 128 experts, 1M ctx) are GA with open weights on Hugging Face since April 5, 2025. Behemoth (~2T) remains unreleased as of April 2026.Ollama 0.22.0 (released April 28, 2026) ships native Windows binaries, built-in web search, and tool-calling for Gemma 4 and Llama 4 — ollama run llama4 just works.llama.cpp now publishes pre-built Windows CUDA binaries (CUDA 12.4 and 13.1, Apr 29 2026 builds). You no longer need to compile from source unless you want bleeding-edge kernels.VRAM math has shifted: Scout at Q4_K_M is ~55 GB and needs 2x consumer GPUs or a 64 GB+ workstation. Unsloth's 1.78-bit dynamic quant runs Scout in 24 GB at ~20 tok/s.WSL2 is now the recommended path for serious Linux-tooling parity — Microsoft ships GPU passthrough for NVIDIA + recent AMD/Intel by default in Windows 11 24H2.The original April 2025 LMArena ranking (ELO 1417) was inflated by an "Experimental" submission; the public Maverick model ranks 32nd with style-control on. Use MMLU/MMLU-Pro for sober comparisons.

Want the full picture? Read our continuously-updated Llama 4 Complete Guide (2026) — Scout and Maverick variants, MoE architecture, and deployment patterns.

TL;DR — pick your path

You haveBest routeRealistic modelSpeed
16 GB GPU (RTX 4060 Ti / 5060 Ti)Ollama, Q4 dynamic quantScout 1.78-bit (Unsloth)~20 tok/s
24 GB GPU (RTX 3090 / 4090 / 5090)Ollama or llama.cppScout Q4_K_M with CPU offload, or Scout 1.78-bit fully on GPU20–35 tok/s
2x 24 GB or 1x 48 GB+ (A6000, H100, RTX 6000 Ada)llama.cpp + tensor-parallelScout Q4_K_M / Q5_K_M fully on GPU40–80 tok/s
CPU-only, 64 GB RAMllama.cpp CPU buildScout Q4_K_M (slow)2–6 tok/s
Maverick (400B)Workstation with 192 GB+ unified RAM or 4x H100Maverick Q4_K_Mworkstation-class only

If your goal is "kick the tyres on Llama 4," install Ollama on Windows and run ollama run llama4. Everything below is for the people who need control.

System requirements (verified, April 2026)

  • OS: Windows 11 (22H2 or newer recommended; 24H2 if you want WSL2 GPU passthrough out of the box). Windows 10 still works for Ollama but is past mainstream support.
  • CPU: any modern x86_64 with AVX2; AVX-512 helps for CPU inference.
  • RAM: 32 GB minimum for Scout-class workloads; 64 GB if you plan to offload layers to CPU.
  • GPU: NVIDIA Ampere or newer (RTX 30/40/50 series, A-series, H-series). 16 GB VRAM is the practical floor for Scout. AMD ROCm on Windows via llama.cpp Vulkan backend works but is slower; Intel Arc via SYCL is supported in llama.cpp head.
  • Drivers: NVIDIA 555+ for CUDA 12.4 prebuilds, 575+ for CUDA 13.x.
  • Disk: 80 GB free for Scout weights + cache; 260 GB if you also pull Maverick.
  • Software: Python 3.11 or 3.12 (3.8 is EOL — do not use it in 2026), Git for Windows, optional Visual Studio 2022 Build Tools if compiling from source.

Path 1 — Ollama (the boring, reliable route)

Ollama 0.22.0 is the path of least resistance. It bundles the runtime, manages the model store at C:\Users\<you>\.ollama\models, and keeps your binary in sync without nuking weights on update.

Install

  1. Download OllamaSetup.exe from ollama.com/download/windows and run it. Or, from PowerShell: winget install Ollama.Ollama.

Pull and run Scout (the default llama4 tag is Scout, 16x17B, ~67 GB on disk):

ollama run llama4

For Maverick (245 GB on disk, requires very serious hardware): ollama run llama4:128x17b.

Open a fresh PowerShell and verify:

ollama --version

Useful environment variables

  • OLLAMA_NUM_GPU=99 — push as many layers as possible to GPU.
  • OLLAMA_KEEP_ALIVE=30m — keep the model resident to avoid reload latency.
  • OLLAMA_FLASH_ATTENTION=1 — enable Flash Attention 2 kernels (default-on in 0.22 for Llama 4, but verify).
  • OLLAMA_CONTEXT_LENGTH=131072 — Llama 4 Scout supports 10M tokens natively, but most consumer setups can't hold that KV cache. 128k is a comfortable ceiling on a 24 GB card.

If you're orchestrating Ollama as part of an agent stack, see our OpenClaw + Ollama setup guide for running local AI agents — it walks through wiring Ollama's new built-in web search and tool-calling endpoints into a working agent loop.

Path 2 — llama.cpp (control over every kernel)

Use llama.cpp when you want raw throughput, when you need to mix CPU/GPU offload precisely, or when Ollama's defaults don't expose the flag you need (sampler stacks, speculative decoding, custom rope scaling, etc.).

Option A — pre-built Windows CUDA binaries

  1. Go to github.com/ggml-org/llama.cpp/releases. Pick the latest cudart-llama-bin-win-cuda-12.4-x64.zip (or 13.1 if your driver is current).
  2. Extract to e.g. C:\llama.cpp\ and add it to PATH.

Run:

llama-cli ^
  --model C:\models\llama4-scout\Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf ^
  --ctx-size 131072 ^
  --n-gpu-layers 99 ^
  --flash-attn ^
  --temp 0.6 ^
  -p "Explain MoE routing in two paragraphs."

Pull a GGUF — Unsloth's quants are the de-facto standard for Llama 4:

huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf --local-dir C:\models\llama4-scout

Option B — build from source

Only do this if a feature you need landed in master after the most recent release. With Visual Studio 2022 Build Tools and CUDA Toolkit 12.4+ installed:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

Note: CUDA Toolkit 13.2 is required if you're on MSVC 2026; expect a CMP194 policy warning during configure (issue #20311 in the upstream repo) — it's safe to ignore.

Quantization choices for Scout

QuantVRAM (full)Quality vs BF16Notes
BF16~220 GBbaselinerequires multi-GPU server
Q8_0~110 GB≈ identicalused 3090 stack or H100
Q5_K_M~75 GB<1% drop2x 48 GB GPUs sweet spot
Q4_K_M~55 GB~1–2% drop2x 24 GB or workstation
IQ2_XXS~32 GBnoticeable dropfits a single 32 GB card
Unsloth 1.78-bit dynamic~22 GB~3–5% dropfits a single 24 GB card; recommended for solo devs

Path 3 — WSL2 (closest thing to a Linux box)

If your tooling assumes Linux paths (vLLM, SGLang, Unsloth fine-tuning, bitsandbytes), use WSL2. Windows 11 24H2 ships wsl --install -d Ubuntu-24.04 with NVIDIA GPU passthrough by default.

wsl --install -d Ubuntu-24.04
wsl --update
# inside the Ubuntu shell:
nvidia-smi              # confirm GPU is visible
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama4

WSL2 gets you the same Ollama or llama.cpp, plus working Unsloth/PEFT for fine-tuning Scout (~71 GB VRAM, fits a single H100). For inference-only workloads, native Windows is faster — there's no I/O overhead crossing the WSL boundary.

How to choose between the three paths

  • Just want to chat / build a small RAG app: Ollama. Stop reading.
  • Need top tok/s, custom samplers, speculative decoding, or to host a llama-server REST endpoint: llama.cpp.
  • Need to fine-tune or run vLLM/SGLang: WSL2.
  • Need 10M-token context inference: llama.cpp with KV-cache offloading and a workstation; consumer hardware can't hold a 10M ctx KV cache regardless of model size.

Performance and benchmarks (verified, 2026)

  • MMLU: Maverick scores ~91.8%, putting it within 0.5 pp of GPT-5.4 (92.3%) and ahead of Claude Sonnet 4 (90.5%). Scout sits around 86%, comparable to Llama 3.3 70B at a quarter of the active-parameter cost.
  • LMArena: Maverick's public weights rank 32nd with style-control enabled. Meta's original April-2025 "Experimental" submission scored 1417 ELO and ranked 2nd; that submission was a different finetune and the discrepancy is well-documented. Treat any "Llama-4-Maverick-03-26-Experimental" number with skepticism.
  • Long-context: independent testing (The Decoder, April 2025) found both models degrade noticeably past ~32k tokens on retrieval tasks. The 10M ctx is real on paper; useful precision past 256k is not.
  • Local throughput: Scout 1.78-bit on a single RTX 4090 lands ~20 tok/s; Q4_K_M on a pair of 4090s with NVLink lands ~45 tok/s with Flash Attention 2 enabled (community numbers from r/LocalLLaMA, April 2026).

Common pitfalls and troubleshooting

  • "CUDA out of memory" on Scout Q4_K_M with 24 GB: expected — Q4_K_M is 55 GB. Switch to Unsloth's 1.78-bit dynamic quant or use --n-gpu-layers below 99 to spill to CPU.
  • Ollama hangs on first run: first launch downloads ~67 GB. Watch %LOCALAPPDATA%\Ollama\logs\server.log; "model is loading" can take 60–90 s on cold start even after download.
  • llama-cli prints garbage tokens: mismatched tokenizer. Re-pull the GGUF; some early March-2025 quants had bad chat templates and were re-uploaded. Use the Unsloth or bartowski repos.
  • "command not recognized": add the llama.cpp folder to PATH or call binaries by absolute path. PowerShell sessions started before the PATH change won't see the update.
  • Windows Defender slows download: exclude C:\Users\<you>\.ollama\models and C:\models\ from real-time scanning.
  • Slow inference on a beefy GPU: verify --flash-attn is on, and verify no second process is holding VRAM. nvidia-smi -l 1 while the model is loading should show one process climbing to your target VRAM.
  • WSL2 GPU not visible: wsl --update, ensure NVIDIA driver 555+, and verify nvidia-smi works inside the Ubuntu shell. If it doesn't, fall back to native Windows.

What was removed and why

  • Python 3.8 — EOL since October 2024; many 2026 tooling chains require 3.11+.
  • The 16384 default ctx-size example — Scout's native ctx is 10M tokens. Defaulting to 16k gives the wrong impression of the model's capability and is no longer the helpful conservative choice it was in 2025; 128k is the sensible local default.
  • Generic "BF16/Q4_K_M" advice without VRAM numbers — readers were quantizing without realizing Q4_K_M is still 55 GB on Scout. Replaced with the explicit table above.
  • Naive git clone meta-llama/llama.cpp — the canonical upstream is ggml-org/llama.cpp. Meta's fork was archived; using the wrong remote is a common time sink.

FAQ

Can I run Llama 4 on a laptop?

Practically: only Scout, only at 1.78-bit dynamic quant, only on a laptop with a 16 GB+ discrete GPU (RTX 4090 mobile, 5090 mobile). Integrated graphics will not work for inference at usable speeds.

What's the difference between Scout and Maverick for local use?

Scout (16x17B, 109B total) is the only one most people should consider running locally. Maverick (128x17B, 400B total) needs ~245 GB of disk and a workstation-class memory pool. Behemoth (~2T) is not publicly released as of April 2026.

Is Llama 5 out?

As of April 2026, Llama 5 has not been confirmed released by Meta on the official llama.com or AI@Meta channels. Several speculative posts circulate; verify the model card on huggingface.co/meta-llama before believing any "Llama 5" tutorial.

Ollama vs llama.cpp — which is faster?

llama.cpp is marginally faster (5–15%) on identical hardware once you've tuned flags, because Ollama defaults are conservative. For most people the gap isn't worth the operational overhead.

Why does Scout need 55 GB at Q4 if it only "uses" 17B parameters?

MoE routes each token through a subset of experts, but every expert weight has to be resident in memory. The 17B "active" figure is a compute number, not a memory one. All 109B parameters must load.

Can I fine-tune Llama 4 on Windows?

Use WSL2 + Unsloth. Scout fine-tunes on a single H100 (80 GB) in ~71 GB VRAM with Unsloth's optimizations. On consumer GPUs, LoRA-only fine-tunes of a few experts are feasible at 24 GB.

Does Ollama support tool calling and image inputs for Llama 4?

Yes. Ollama 0.22.0 supports Llama 4's multimodal text-and-image inputs and the model's native tool-calling JSON format. The endpoint matches the OpenAI-compatible /v1/chat/completions shape so most agent frameworks work without changes.

Is there a commercial-use restriction?

The Llama 4 Community License allows commercial use up to 700 million monthly active users; above that you need a separate Meta agreement. Most teams are well under that ceiling — but read the license, because the terms also restrict using Llama outputs to train competing models.


References and further reading