Run Microsoft Phi-4 on Windows: Complete 2026 Installation Guide (All Variants)
Last updated April 2026 — refreshed for current model/tool versions.
Microsoft's Phi-4 family has grown substantially since this guide was first published. What started as a single 14B text-only model is now a suite of specialized small language models covering reasoning, multimodal input, and edge deployment — each installable on Windows in under 30 minutes. This guide covers every variant, every installation method, and the concrete hardware requirements you need before you start downloading.
What changed in 2026 — key updates from 2025:Phi-4 family expanded: Microsoft released Phi-4-mini (3.8B, Feb 2025), Phi-4-multimodal-instruct (5.6B, Feb 2025), Phi-4-reasoning and Phi-4-reasoning-plus (14B, Apr 2025), Phi-4-mini-reasoning (Apr 2025), and Phi-4-reasoning-vision-15B (Mar 2026).Microsoft Foundry Local is now the easiest Windows install path:winget install Microsoft.FoundryLocal+foundry model run phi-4-minidownloads, optimizes for your GPU/CPU, and runs in one step — no Python or CUDA required.Ollama 0.22.0 (Apr 2026): Officialphi4andphi4-minitags available atollama.com/library/phi4; 9.1 GB download for the 14B model.LM Studio 0.4.12 fully supports Phi-4 and Phi-4-mini in GGUF format (8.30 GB and 2.10 GB respectively).CUDA requirements updated: The CUDA 12.2 minimum from the original post is superseded — current PyTorch wheels ship CUDA 12.6 binaries by default; cuDNN 9.x is bundled automatically via pip.Phi-4-reasoning-plus on AIME 2025: 82.5% accuracy — comparable to full DeepSeek-R1 (671B) despite being 14B parameters.
The Phi-4 Family: Which Model Do You Actually Need?
| Model | Parameters | Context | Modalities | VRAM (Q4) | Best for |
|---|---|---|---|---|---|
| Phi-4 | 14B | 16K | Text | ~10 GB | General reasoning, coding, math |
| Phi-4-mini | 3.8B | 128K | Text + function calling | ~3 GB | Low-RAM laptops, agents, function calling |
| Phi-4-multimodal | 5.6B | 128K | Text + vision + audio | ~5 GB | Image QA, speech transcription |
| Phi-4-reasoning | 14B | 32K | Text | ~10 GB | Step-by-step math/logic chains |
| Phi-4-reasoning-plus | 14B | 32K | Text | ~10 GB | Competition-level math (AIME, Omni-Math) |
| Phi-4-mini-reasoning | 3.8B | 128K | Text | ~3 GB | Reasoning on low-VRAM hardware |
All models are MIT-licensed. For most Windows users without a dedicated workstation GPU, start with Phi-4-mini (Foundry Local or Ollama). If you have an RTX 3060 12 GB or better, run Phi-4 (14B). For math-intensive workflows, reach for Phi-4-reasoning-plus.
System Requirements
Hardware
| Component | Minimum (Phi-4-mini) | Recommended (Phi-4 14B) | Power user (Phi-4 14B full precision) |
|---|---|---|---|
| GPU | Integrated / any 6 GB VRAM | RTX 3060 12 GB | RTX 3090 / 4090 24 GB |
| RAM | 16 GB DDR4 | 32 GB DDR4 | 64 GB DDR5 |
| Disk | 10 GB NVMe | 25 GB NVMe | 50 GB NVMe |
| OS | Windows 10 64-bit | Windows 11 64-bit | Windows 11 64-bit |
Software prerequisites (GPU path)
- NVIDIA CUDA Toolkit 12.4+ (12.6 preferred — current PyTorch pip wheels target 12.6 by default)
- cuDNN 9.x — bundled automatically when you
pip install torchvia the PyTorch index URL; no manual install needed - Python 3.10–3.12 (3.12 is the current recommended release)
- Git 2.43+
- Visual Studio Build Tools 2022 (required only if building
flash-attnfrom source)
CPU-only / no-GPU path: Microsoft Foundry Local handles all of this automatically, including NPU acceleration on Copilot+ PCs. No CUDA install needed.
Installation Methods
Method 1: Microsoft Foundry Local (Easiest — No Python Required)
Microsoft Foundry Local was released in 2025 and is the lowest-friction way to run any Phi-4 model locally on Windows. It installs via winget, detects your GPU/NPU automatically, and provides an OpenAI-compatible API endpoint — no Python, no CUDA install, no virtual environments.
Install Foundry Local:
winget install Microsoft.FoundryLocalRun Phi-4-mini (3.8B, 3.72 GB GPU download):
foundry model run phi-4-miniRun Phi-4 14B:
foundry model run phi-4Foundry Local automatically selects the GPU-optimized variant if you have a compatible NVIDIA GPU, or falls back to the CPU-only ONNX model (4.80 GB for phi-4-mini). The interactive prompt starts immediately after the download. Type /bye to exit.
System requirements: Windows 10/11, 16 GB RAM minimum (32 GB recommended). No Azure subscription required.
Method 2: Ollama (Best for API access and scripts)
Ollama version 0.22.0 (April 2026) ships official phi4 and phi4-mini tags. It exposes a local REST API compatible with the OpenAI client library, making it easy to integrate Phi-4 into your own applications. If you are exploring different local AI setups, our OpenClaw + Ollama setup guide for running local AI agents covers advanced agent configurations beyond basic model serving.
Install Ollama on Windows:
- Download the Windows installer from ollama.com/download (no administrator rights required; installs to your home directory).
- Run the installer and follow the prompts.
- Verify: open a new terminal and run
ollama --version.
Pull and run Phi-4 (14B, 9.1 GB download):
ollama pull phi4
ollama run phi4Pull and run Phi-4-mini (3.8B, ~2.5 GB download):
ollama pull phi4-mini
ollama run phi4-miniUse as an OpenAI-compatible API:
ollama serveThen call http://localhost:11434/v1/chat/completions with any OpenAI client. Ollama handles GPU detection automatically on Windows.
Method 3: LM Studio (Best for non-developers with a GUI)
LM Studio 0.4.12 fully supports Phi-4 and Phi-4-mini on Windows 10 and 11. It provides a chat interface, model comparison, and a local server mode with a GUI — no command line required.
- Download LM Studio from lmstudio.ai (auto-detects Windows).
- In the Models tab, search for phi-4 or phi-4-mini.
- Choose a GGUF quantization:
- Q4_K_M — best balance of size and quality (8.30 GB for Phi-4 14B, 2.10 GB for Phi-4-mini).
- Q5_K_M — higher quality, ~10 GB for Phi-4 14B.
- Q8_0 — near-lossless, needs 14+ GB VRAM for Phi-4 14B.
- Click Download.
- Select the model in the chat interface and click Run.
Enable GPU acceleration under Settings → GPU. Allocate 80–90% of VRAM for best throughput. The context window defaults to 4096 tokens; raise it to 16384 under Model Config if your VRAM allows.
Method 4: Python + HuggingFace Transformers (Most flexible)
Use this path if you need fine-grained control, are building an application, or need the Phi-4-multimodal vision/audio capabilities.
Step 1: Create a virtual environment
mkdir phi4-project
cd phi4-project
python -m venv venv
venv\Scripts\activateStep 2: Install PyTorch with CUDA 12.6 support
# GPU (NVIDIA, CUDA 12.6 — includes cuDNN 9 automatically)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# CPU-only (slower but works on any machine)
pip install torch torchvision torchaudioStep 3: Install model dependencies
pip install transformers>=4.48.2 accelerate huggingface-hubStep 4: Download and run Phi-4 (14B text model)
from huggingface_hub import snapshot_download
snapshot_download(repo_id="microsoft/phi-4", local_dir="./phi4-model")import transformers
pipeline = transformers.pipeline(
"text-generation",
model="./phi4-model",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the chain rule in calculus with an example."},
]
output = pipeline(messages, max_new_tokens=512)
print(output[0]["generated_text"][-1]["content"])Method 5: Phi-4-multimodal (Vision + Audio on Windows)
Phi-4-multimodal-instruct (5.6B, released February 2025) supports text, vision, and audio in a single model. It ranks #1 on the HuggingFace OpenASR leaderboard (6.02% WER as of early 2026). Install dependencies beyond the base transformers stack:
pip install transformers>=4.48.2 accelerate pillow soundfile scipy peft
# flash_attn optional but recommended for RTX 40-series cards
pip install flash-attn --no-build-isolationDownload the model:
huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ./phi4-mmImage analysis example:
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
model_path = "./phi4-mm"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
).eval()
generation_config = GenerationConfig.from_pretrained(model_path)
image = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)
prompt = "<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=512, generation_config=generation_config)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])Audio transcription example:
import soundfile as sf
import io
from urllib.request import urlopen
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4.flac"
audio, rate = sf.read(io.BytesIO(urlopen(audio_url).read()))
prompt = "<|user|><|audio_1|>Transcribe the audio.<|end|><|assistant|>"
inputs = processor(text=prompt, audios=[(audio, rate)], return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=1000, generation_config=generation_config)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])Method 6: Docker with Ollama (Isolated environment)
Useful when you want to keep the model runtime isolated from your main Windows environment, or when deploying to Windows Server 2025.
# Requires Docker Desktop with WSL2 backend and NVIDIA Container Toolkit
docker run -d --gpus all -p 11434:11434 --name ollama ollama/ollama:latest
docker exec ollama ollama pull phi4
docker exec -it ollama ollama run phi4How to Choose Your Installation Method
- No Python experience, just want to chat: → LM Studio 0.4.12
- Quickest path, any hardware, Windows 10/11: → Microsoft Foundry Local (
winget install Microsoft.FoundryLocal) - Need an OpenAI-compatible API for your app: → Ollama +
ollama serve - Need vision or audio input: → Python + Phi-4-multimodal-instruct (Method 5)
- Constrained laptop, 16 GB RAM, no discrete GPU: → Phi-4-mini via Foundry Local or Ollama
- Competition math, step-by-step reasoning chains: → Phi-4-reasoning-plus via Python transformers
- Production deployment / isolated environment: → Docker + Ollama
Performance and Benchmarks
Benchmark scores (official Microsoft / Hugging Face)
| Benchmark | Phi-4 14B | Phi-4-mini 3.8B | GPT-4o-mini | Llama 3.3 70B |
|---|---|---|---|---|
| MMLU | 84.8% | 73.0% | 81.8% | ~83% |
| MATH | 80.4% | 62.0% | 73.0% | — |
| GPQA (Science) | 56.1% | — | 40.9% | — |
| HumanEval (Code) | 82.6% | — | 86.2% | — |
| AIME 2025 | — | — | — | — |
Phi-4-reasoning-plus achieves 82.5% on AIME 2025 (an average-of-5 run) — comparable to full DeepSeek-R1 (671B) despite having 14B parameters. Note that AIME 2025 has only 30 problems; run-to-run variance of 5–10 percentage points is expected, as Microsoft acknowledges in the technical report.
Windows inference throughput (measured on consumer hardware, GGUF Q4_K_M via Ollama)
| GPU | Model | Tokens/sec | VRAM used |
|---|---|---|---|
| RTX 3060 12 GB | Phi-4 14B Q4_K_M | ~18–22 | ~10 GB |
| RTX 3090 24 GB | Phi-4 14B Q4_K_M | ~40–48 | ~10 GB |
| RTX 4090 24 GB | Phi-4 14B Q4_K_M | ~65–80 | ~10 GB |
| CPU only (Core i9) | Phi-4-mini Q4_K_M | ~6–10 | — (RAM) |
Throughput ranges reflect community-reported values from r/LocalLLaMA and local benchmarking blogs; verify on your hardware — exact speeds vary by driver version, system RAM speed, and thermal state.
Optimization and Performance Tuning
Quantization selection guide (GGUF)
For Phi-4 14B in GGUF format (available via LM Studio and Ollama):
- Q3_K_M — 6.5 GB, fits in 8 GB VRAM with headroom. Visible quality drop on complex reasoning.
- Q4_K_M — 8.3 GB, best balance. Recommended for most users. Retains ~95% of full-precision quality on reasoning tasks.
- Q5_K_M — 9.8 GB, higher fidelity math and code output. Use if you have 12 GB VRAM.
- Q8_0 — 14.7 GB, near-lossless. Needs an RTX 3090/4090 for full GPU inference.
Flash Attention 2 (transformers path, RTX 30/40-series)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-4",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
)Flash Attention 2 requires a GPU with compute capability 8.0+ (Ampere or newer). On older cards (RTX 2080, V100), omit this argument or use attn_implementation="eager".
Batch processing for throughput
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
batch_size=4,
max_new_tokens=512,
)Common Issues and Troubleshooting
CUDA out of memory (OOM)
- Switch to a lower GGUF quantization (Q4_K_M instead of Q8_0).
- Use
device_map="auto"withmax_memory={0: "10GiB", "cpu": "20GiB"}to offload layers to RAM. - In Ollama, set
OLLAMA_GPU_LAYERS=20(environment variable) to offload only some layers to GPU and the rest to CPU. - Phi-4-mini (Q4_K_M) requires only ~3 GB VRAM — switch models if 14B doesn't fit.
DLL load errors on Windows
- Install the latest Visual C++ Redistributable 2022.
- If using Ollama, reinstall via the .exe installer (not winget) to ensure proper PATH configuration.
- Ensure Python and CUDA bin directories are in your
%PATH%.
Slow inference (CPU fallback)
- Run
ollama run phi4 --verbose— the output shows whether GPU layers are being used. If all layers show as CPU, your GPU is not being detected. - In LM Studio: Settings → GPU → toggle GPU Acceleration on and restart the model.
- Foundry Local: run
foundry device listto confirm GPU detection. - Confirm CUDA installation:
nvidia-smishould show your GPU and driver version.
Foundry Local: winget package not found
- Update winget: open Microsoft Store → Library → Get Updates.
- Alternatively, download the installer directly from Microsoft Learn: Get started with Foundry Local.
Hugging Face authentication errors
- Phi-4 and all Phi-4 variants are MIT-licensed and publicly accessible. If you see a 401 error, run
huggingface-cli loginwith a free HF account token.
transformers version incompatibility
- Phi-4-multimodal requires
transformers>=4.48.2. Runpip install --upgrade transformersif you see errors likeAttributeError: 'Phi4MMForCausalLM' object has no attribute....
Use Case Examples
Local coding assistant with Phi-4
messages = [
{"role": "system", "content": "You are an expert Python developer. Write clean, documented code."},
{"role": "user", "content": "Write a Python function that parses a JWT token without external libraries."},
]
output = pipeline(messages, max_new_tokens=1024)
print(output[0]["generated_text"][-1]["content"])Step-by-step math reasoning (Phi-4-reasoning-plus)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "microsoft/Phi-4-reasoning-plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Find all integer solutions to x^2 - 5x + 6 = 0 and explain each step."}],
tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, do_sample=True)
print(tokenizer.decode(ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))Building with Local AI? Get Help from Remote Developers Who've Done It
If you are integrating a local Phi-4 deployment into a production application — whether for private data processing, a reasoning pipeline, or an embedded AI feature — Codersera can connect you with vetted remote developers who have hands-on experience with local LLM integration on Windows and Linux.
FAQ
Can Phi-4 run on a laptop without a GPU?
Yes. Use Microsoft Foundry Local or Ollama with Phi-4-mini (3.8B). On a modern Intel/AMD laptop with 16 GB RAM, CPU-only inference produces roughly 5–10 tokens/second — slow but functional for development and testing.
Is Phi-4 free to use commercially?
All Phi-4 variants are released under the MIT License, which allows commercial use. Download them from the Microsoft organization on Hugging Face without any subscription or usage fee.
What is the difference between Phi-4 and Phi-4-reasoning-plus?
Phi-4 (14B) is a general-purpose model trained on synthetic and curated data for reasoning, coding, and math. Phi-4-reasoning-plus is the same architecture fine-tuned via outcome-based reinforcement learning to generate extended reasoning chains — similar to how OpenAI o-series models work. Phi-4-reasoning-plus achieves 82.5% on AIME 2025 vs. Phi-4's weaker performance on competition math, at the cost of significantly longer outputs and slower inference.
Does Phi-4 support languages other than English?
Phi-4 (14B text model) is primarily English-focused. Phi-4-mini has a 200,000-token vocabulary with improved multilingual support across 23 languages for text. Phi-4-multimodal supports audio input in English, Chinese, German, French, Italian, Japanese, Spanish, and Portuguese.
How does Phi-4 compare to Llama 4 and Qwen 3 for local use?
At 14B parameters, Phi-4 beats Llama 3.3-70B on several reasoning benchmarks (GPQA: 56.1% vs. ~46%). For general-purpose chat and coding, Qwen 3 and Llama 4 Scout offer competitive quality, but Phi-4's MIT license and smaller footprint make it the default choice when VRAM is constrained. Phi-4-reasoning-plus specifically excels on structured math — Llama 4 and Qwen 3 do not yet match it on AIME benchmarks at comparable parameter counts.
Can I run Phi-4-multimodal on Windows with just a CPU?
Technically yes, but it is slow — 5.6B parameters in full precision requires significant RAM, and audio/vision preprocessing adds overhead. For CPU-only multimodal use, target Phi-4-multimodal in Q4 quantization via a compatible llama.cpp build (llama.cpp added vision support in late 2024). For practical performance, a 6 GB VRAM GPU is recommended.
Does LM Studio support Phi-4-multimodal?
As of LM Studio 0.4.12, multimodal support (vision input) is available for compatible GGUF models. Check the LM Studio changelog at lmstudio.ai/blog for the current state — the team ships updates frequently. For guaranteed multimodal capability, use the Python transformers path (Method 5 above).
What happened to the vanilj/Phi-4 Ollama model from earlier guides?
The vanilj/Phi-4 community model was a third-party GGUF upload used before Microsoft published official model weights on Ollama's library. Microsoft now maintains the official phi4 and phi4-mini tags at ollama.com/library/phi4. Use those instead — they are verified, updated, and maintained by Microsoft.
Related guides on Codersera
- Run Microsoft Phi-4 on Mac: Installation Guide
- Run Microsoft Phi-4 on Ubuntu: A Comprehensive Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
References and Further Reading
- microsoft/phi-4 on Hugging Face — official model card with benchmark scores and usage examples
- microsoft/Phi-4-multimodal-instruct on Hugging Face — 5.6B multimodal model card, released February 2025
- Phi-4-reasoning Technical Report (arXiv 2504.21318) — Microsoft Research, April 2025
- Running Phi-4 Locally with Microsoft Foundry Local — Microsoft Tech Community
- Get started with Foundry Local — Microsoft Learn official docs
- phi4 on Ollama Library — official pull commands and model tags
- microsoft/phi-4 on LM Studio — GGUF download sizes and model catalog
- Welcome to the new Phi-4 models — Phi-4-mini & Phi-4-multimodal — Microsoft Tech Community, February 2025