microsoft

Run Microsoft Phi-4 on Windows: Complete 2026 Installation Guide (All Variants)

Published 01 May 2025 • Updated 10 May 2026 • 12 min read

Quick answer. Microsoft's Phi-4 family now spans seven MIT-licensed variants from 3.8B mini through 14B reasoning-plus and 15B reasoning-vision. The fastest 2026 install on Windows is Foundry Local: winget install Microsoft.FoundryLocal then foundry model run phi-4-mini. Ollama 0.22 and LM Studio 0.4.12 also ship official phi4 GGUFs; expect about 3-10 GB VRAM depending on the variant.

Last updated April 2026 — refreshed for current model/tool versions.

Microsoft's Phi-4 family has grown substantially since this guide was first published. What started as a single 14B text-only model is now a suite of specialized small language models covering reasoning, multimodal input, and edge deployment — each installable on Windows in under 30 minutes. This guide covers every variant, every installation method, and the concrete hardware requirements you need before you start downloading.

What changed in 2026 — key updates from 2025:Phi-4 family expanded: Microsoft released Phi-4-mini (3.8B, Feb 2025), Phi-4-multimodal-instruct (5.6B, Feb 2025), Phi-4-reasoning and Phi-4-reasoning-plus (14B, Apr 2025), Phi-4-mini-reasoning (Apr 2025), and Phi-4-reasoning-vision-15B (Mar 2026).Microsoft Foundry Local is now the easiest Windows install path: winget install Microsoft.FoundryLocal + foundry model run phi-4-mini downloads, optimizes for your GPU/CPU, and runs in one step — no Python or CUDA required.Ollama 0.22.0 (Apr 2026): Official phi4 and phi4-mini tags available at ollama.com/library/phi4; 9.1 GB download for the 14B model.LM Studio 0.4.12 fully supports Phi-4 and Phi-4-mini in GGUF format (8.30 GB and 2.10 GB respectively).CUDA requirements updated: The CUDA 12.2 minimum from the original post is superseded — current PyTorch wheels ship CUDA 12.6 binaries by default; cuDNN 9.x is bundled automatically via pip.Phi-4-reasoning-plus on AIME 2025: 82.5% accuracy — comparable to full DeepSeek-R1 (671B) despite being 14B parameters.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

The Phi-4 Family: Which Model Do You Actually Need?

Model	Parameters	Context	Modalities	VRAM (Q4)	Best for
Phi-4	14B	16K	Text	~10 GB	General reasoning, coding, math
Phi-4-mini	3.8B	128K	Text + function calling	~3 GB	Low-RAM laptops, agents, function calling
Phi-4-multimodal	5.6B	128K	Text + vision + audio	~5 GB	Image QA, speech transcription
Phi-4-reasoning	14B	32K	Text	~10 GB	Step-by-step math/logic chains
Phi-4-reasoning-plus	14B	32K	Text	~10 GB	Competition-level math (AIME, Omni-Math)
Phi-4-mini-reasoning	3.8B	128K	Text	~3 GB	Reasoning on low-VRAM hardware

All models are MIT-licensed. For most Windows users without a dedicated workstation GPU, start with Phi-4-mini (Foundry Local or Ollama). If you have an RTX 3060 12 GB or better, run Phi-4 (14B). For math-intensive workflows, reach for Phi-4-reasoning-plus.

System Requirements

Hardware

Component	Minimum (Phi-4-mini)	Recommended (Phi-4 14B)	Power user (Phi-4 14B full precision)
GPU	Integrated / any 6 GB VRAM	RTX 3060 12 GB	RTX 3090 / 4090 24 GB
RAM	16 GB DDR4	32 GB DDR4	64 GB DDR5
Disk	10 GB NVMe	25 GB NVMe	50 GB NVMe
OS	Windows 10 64-bit	Windows 11 64-bit	Windows 11 64-bit

Software prerequisites (GPU path)

NVIDIA CUDA Toolkit 12.4+ (12.6 preferred — current PyTorch pip wheels target 12.6 by default)
cuDNN 9.x — bundled automatically when you pip install torch via the PyTorch index URL; no manual install needed
Python 3.10–3.12 (3.12 is the current recommended release)
Git 2.43+
Visual Studio Build Tools 2022 (required only if building flash-attn from source)

CPU-only / no-GPU path: Microsoft Foundry Local handles all of this automatically, including NPU acceleration on Copilot+ PCs. No CUDA install needed.

Installation Methods

Method 1: Microsoft Foundry Local (Easiest — No Python Required)

Microsoft Foundry Local was released in 2025 and is the lowest-friction way to run any Phi-4 model locally on Windows. It installs via winget, detects your GPU/NPU automatically, and provides an OpenAI-compatible API endpoint — no Python, no CUDA install, no virtual environments.

Install Foundry Local:

winget install Microsoft.FoundryLocal

Run Phi-4-mini (3.8B, 3.72 GB GPU download):

foundry model run phi-4-mini

Run Phi-4 14B:

foundry model run phi-4

Foundry Local automatically selects the GPU-optimized variant if you have a compatible NVIDIA GPU, or falls back to the CPU-only ONNX model (4.80 GB for phi-4-mini). The interactive prompt starts immediately after the download. Type /bye to exit.

System requirements: Windows 10/11, 16 GB RAM minimum (32 GB recommended). No Azure subscription required.

Method 2: Ollama (Best for API access and scripts)

Ollama version 0.22.0 (April 2026) ships official phi4 and phi4-mini tags. It exposes a local REST API compatible with the OpenAI client library, making it easy to integrate Phi-4 into your own applications. If you are exploring different local AI setups, our OpenClaw + Ollama setup guide for running local AI agents covers advanced agent configurations beyond basic model serving.

Install Ollama on Windows:

Download the Windows installer from ollama.com/download (no administrator rights required; installs to your home directory).
Run the installer and follow the prompts.
Verify: open a new terminal and run ollama --version.

Pull and run Phi-4 (14B, 9.1 GB download):

ollama pull phi4
ollama run phi4

Pull and run Phi-4-mini (3.8B, ~2.5 GB download):

ollama pull phi4-mini
ollama run phi4-mini

Use as an OpenAI-compatible API:

ollama serve

Then call http://localhost:11434/v1/chat/completions with any OpenAI client. Ollama handles GPU detection automatically on Windows.

Method 3: LM Studio (Best for non-developers with a GUI)

LM Studio 0.4.12 fully supports Phi-4 and Phi-4-mini on Windows 10 and 11. It provides a chat interface, model comparison, and a local server mode with a GUI — no command line required.

Download LM Studio from lmstudio.ai (auto-detects Windows).
In the Models tab, search for phi-4 or phi-4-mini.
Choose a GGUF quantization:
- Q4_K_M — best balance of size and quality (8.30 GB for Phi-4 14B, 2.10 GB for Phi-4-mini).
- Q5_K_M — higher quality, ~10 GB for Phi-4 14B.
- Q8_0 — near-lossless, needs 14+ GB VRAM for Phi-4 14B.
Click Download.
Select the model in the chat interface and click Run.

Enable GPU acceleration under Settings → GPU. Allocate 80–90% of VRAM for best throughput. The context window defaults to 4096 tokens; raise it to 16384 under Model Config if your VRAM allows.

Method 4: Python + HuggingFace Transformers (Most flexible)

Use this path if you need fine-grained control, are building an application, or need the Phi-4-multimodal vision/audio capabilities.

Step 1: Create a virtual environment

mkdir phi4-project
cd phi4-project
python -m venv venv
venv\Scripts\activate

Step 2: Install PyTorch with CUDA 12.6 support

# GPU (NVIDIA, CUDA 12.6 — includes cuDNN 9 automatically)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# CPU-only (slower but works on any machine)
pip install torch torchvision torchaudio

Step 3: Install model dependencies

pip install transformers>=4.48.2 accelerate huggingface-hub

Step 4: Download and run Phi-4 (14B text model)

from huggingface_hub import snapshot_download

snapshot_download(repo_id="microsoft/phi-4", local_dir="./phi4-model")

import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model="./phi4-model",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the chain rule in calculus with an example."},
]

output = pipeline(messages, max_new_tokens=512)
print(output[0]["generated_text"][-1]["content"])

Method 5: Phi-4-multimodal (Vision + Audio on Windows)

Phi-4-multimodal-instruct (5.6B, released February 2025) supports text, vision, and audio in a single model. It ranks #1 on the HuggingFace OpenASR leaderboard (6.02% WER as of early 2026). Install dependencies beyond the base transformers stack:

pip install transformers>=4.48.2 accelerate pillow soundfile scipy peft
# flash_attn optional but recommended for RTX 40-series cards
pip install flash-attn --no-build-isolation

Download the model:

huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ./phi4-mm

Image analysis example:

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

model_path = "./phi4-mm"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
).eval()

generation_config = GenerationConfig.from_pretrained(model_path)

image = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)
prompt = "<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=512, generation_config=generation_config)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Audio transcription example:

import soundfile as sf
import io
from urllib.request import urlopen

audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4.flac"
audio, rate = sf.read(io.BytesIO(urlopen(audio_url).read()))

prompt = "<|user|><|audio_1|>Transcribe the audio.<|end|><|assistant|>"
inputs = processor(text=prompt, audios=[(audio, rate)], return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=1000, generation_config=generation_config)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Method 6: Docker with Ollama (Isolated environment)

Useful when you want to keep the model runtime isolated from your main Windows environment, or when deploying to Windows Server 2025.

# Requires Docker Desktop with WSL2 backend and NVIDIA Container Toolkit
docker run -d --gpus all -p 11434:11434 --name ollama ollama/ollama:latest
docker exec ollama ollama pull phi4
docker exec -it ollama ollama run phi4

How to Choose Your Installation Method

No Python experience, just want to chat: → LM Studio 0.4.12
Quickest path, any hardware, Windows 10/11: → Microsoft Foundry Local (winget install Microsoft.FoundryLocal)
Need an OpenAI-compatible API for your app: → Ollama + ollama serve
Need vision or audio input: → Python + Phi-4-multimodal-instruct (Method 5)
Constrained laptop, 16 GB RAM, no discrete GPU: → Phi-4-mini via Foundry Local or Ollama
Competition math, step-by-step reasoning chains: → Phi-4-reasoning-plus via Python transformers
Production deployment / isolated environment: → Docker + Ollama

Performance and Benchmarks

Benchmark scores (official Microsoft / Hugging Face)

Benchmark	Phi-4 14B	Phi-4-mini 3.8B	GPT-4o-mini	Llama 3.3 70B
MMLU	84.8%	73.0%	81.8%	~83%
MATH	80.4%	62.0%	73.0%	—
GPQA (Science)	56.1%	—	40.9%	—
HumanEval (Code)	82.6%	—	86.2%	—
AIME 2025	—	—	—	—

Phi-4-reasoning-plus achieves 82.5% on AIME 2025 (an average-of-5 run) — comparable to full DeepSeek-R1 (671B) despite having 14B parameters. Note that AIME 2025 has only 30 problems; run-to-run variance of 5–10 percentage points is expected, as Microsoft acknowledges in the technical report.

Windows inference throughput (measured on consumer hardware, GGUF Q4_K_M via Ollama)

GPU	Model	Tokens/sec	VRAM used
RTX 3060 12 GB	Phi-4 14B Q4_K_M	~18–22	~10 GB
RTX 3090 24 GB	Phi-4 14B Q4_K_M	~40–48	~10 GB
RTX 4090 24 GB	Phi-4 14B Q4_K_M	~65–80	~10 GB
CPU only (Core i9)	Phi-4-mini Q4_K_M	~6–10	— (RAM)

Throughput ranges reflect community-reported values from r/LocalLLaMA and local benchmarking blogs; verify on your hardware — exact speeds vary by driver version, system RAM speed, and thermal state.

Optimization and Performance Tuning

Quantization selection guide (GGUF)

For Phi-4 14B in GGUF format (available via LM Studio and Ollama):

Q3_K_M — 6.5 GB, fits in 8 GB VRAM with headroom. Visible quality drop on complex reasoning.
Q4_K_M — 8.3 GB, best balance. Recommended for most users. Retains ~95% of full-precision quality on reasoning tasks.
Q5_K_M — 9.8 GB, higher fidelity math and code output. Use if you have 12 GB VRAM.
Q8_0 — 14.7 GB, near-lossless. Needs an RTX 3090/4090 for full GPU inference.

Flash Attention 2 (transformers path, RTX 30/40-series)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4",
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

Flash Attention 2 requires a GPU with compute capability 8.0+ (Ampere or newer). On older cards (RTX 2080, V100), omit this argument or use attn_implementation="eager".

Batch processing for throughput

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    batch_size=4,
    max_new_tokens=512,
)

Common Issues and Troubleshooting

CUDA out of memory (OOM)

Switch to a lower GGUF quantization (Q4_K_M instead of Q8_0).
Use device_map="auto" with max_memory={0: "10GiB", "cpu": "20GiB"} to offload layers to RAM.
In Ollama, set OLLAMA_GPU_LAYERS=20 (environment variable) to offload only some layers to GPU and the rest to CPU.
Phi-4-mini (Q4_K_M) requires only ~3 GB VRAM — switch models if 14B doesn't fit.

DLL load errors on Windows

Install the latest Visual C++ Redistributable 2022.
If using Ollama, reinstall via the .exe installer (not winget) to ensure proper PATH configuration.
Ensure Python and CUDA bin directories are in your %PATH%.

Slow inference (CPU fallback)

Run ollama run phi4 --verbose — the output shows whether GPU layers are being used. If all layers show as CPU, your GPU is not being detected.
In LM Studio: Settings → GPU → toggle GPU Acceleration on and restart the model.
Foundry Local: run foundry device list to confirm GPU detection.
Confirm CUDA installation: nvidia-smi should show your GPU and driver version.

Foundry Local: winget package not found

Update winget: open Microsoft Store → Library → Get Updates.
Alternatively, download the installer directly from Microsoft Learn: Get started with Foundry Local.

Hugging Face authentication errors

Phi-4 and all Phi-4 variants are MIT-licensed and publicly accessible. If you see a 401 error, run huggingface-cli login with a free HF account token.

transformers version incompatibility

Phi-4-multimodal requires transformers>=4.48.2. Run pip install --upgrade transformers if you see errors like AttributeError: 'Phi4MMForCausalLM' object has no attribute....

Use Case Examples

Local coding assistant with Phi-4

messages = [
    {"role": "system", "content": "You are an expert Python developer. Write clean, documented code."},
    {"role": "user", "content": "Write a Python function that parses a JWT token without external libraries."},
]
output = pipeline(messages, max_new_tokens=1024)
print(output[0]["generated_text"][-1]["content"])

Step-by-step math reasoning (Phi-4-reasoning-plus)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "microsoft/Phi-4-reasoning-plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Find all integer solutions to x^2 - 5x + 6 = 0 and explain each step."}],
    tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, do_sample=True)
print(tokenizer.decode(ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Building with Local AI? Get Help from Remote Developers Who've Done It

If you are integrating a local Phi-4 deployment into a production application — whether for private data processing, a reasoning pipeline, or an embedded AI feature — Codersera can connect you with vetted remote developers who have hands-on experience with local LLM integration on Windows and Linux.

FAQ

Can Phi-4 run on a laptop without a GPU?

Yes. Use Microsoft Foundry Local or Ollama with Phi-4-mini (3.8B). On a modern Intel/AMD laptop with 16 GB RAM, CPU-only inference produces roughly 5–10 tokens/second — slow but functional for development and testing.

Is Phi-4 free to use commercially?

All Phi-4 variants are released under the MIT License, which allows commercial use. Download them from the Microsoft organization on Hugging Face without any subscription or usage fee.

What is the difference between Phi-4 and Phi-4-reasoning-plus?

Phi-4 (14B) is a general-purpose model trained on synthetic and curated data for reasoning, coding, and math. Phi-4-reasoning-plus is the same architecture fine-tuned via outcome-based reinforcement learning to generate extended reasoning chains — similar to how OpenAI o-series models work. Phi-4-reasoning-plus achieves 82.5% on AIME 2025 vs. Phi-4's weaker performance on competition math, at the cost of significantly longer outputs and slower inference.

Does Phi-4 support languages other than English?

Phi-4 (14B text model) is primarily English-focused. Phi-4-mini has a 200,000-token vocabulary with improved multilingual support across 23 languages for text. Phi-4-multimodal supports audio input in English, Chinese, German, French, Italian, Japanese, Spanish, and Portuguese.

How does Phi-4 compare to Llama 4 and Qwen 3 for local use?

At 14B parameters, Phi-4 beats Llama 3.3-70B on several reasoning benchmarks (GPQA: 56.1% vs. ~46%). For general-purpose chat and coding, Qwen 3 and Llama 4 Scout offer competitive quality, but Phi-4's MIT license and smaller footprint make it the default choice when VRAM is constrained. Phi-4-reasoning-plus specifically excels on structured math — Llama 4 and Qwen 3 do not yet match it on AIME benchmarks at comparable parameter counts.

Can I run Phi-4-multimodal on Windows with just a CPU?

Technically yes, but it is slow — 5.6B parameters in full precision requires significant RAM, and audio/vision preprocessing adds overhead. For CPU-only multimodal use, target Phi-4-multimodal in Q4 quantization via a compatible llama.cpp build (llama.cpp added vision support in late 2024). For practical performance, a 6 GB VRAM GPU is recommended.

Does LM Studio support Phi-4-multimodal?

As of LM Studio 0.4.12, multimodal support (vision input) is available for compatible GGUF models. Check the LM Studio changelog at lmstudio.ai/blog for the current state — the team ships updates frequently. For guaranteed multimodal capability, use the Python transformers path (Method 5 above).

What happened to the `vanilj/Phi-4` Ollama model from earlier guides?

The vanilj/Phi-4 community model was a third-party GGUF upload used before Microsoft published official model weights on Ollama's library. Microsoft now maintains the official phi4 and phi4-mini tags at ollama.com/library/phi4. Use those instead — they are verified, updated, and maintained by Microsoft.

References and Further Reading

microsoft/phi-4 on Hugging Face — official model card with benchmark scores and usage examples
microsoft/Phi-4-multimodal-instruct on Hugging Face — 5.6B multimodal model card, released February 2025
Phi-4-reasoning Technical Report (arXiv 2504.21318) — Microsoft Research, April 2025
Running Phi-4 Locally with Microsoft Foundry Local — Microsoft Tech Community
Get started with Foundry Local — Microsoft Learn official docs
phi4 on Ollama Library — official pull commands and model tags
microsoft/phi-4 on LM Studio — GGUF download sizes and model catalog
Welcome to the new Phi-4 models — Phi-4-mini & Phi-4-multimodal — Microsoft Tech Community, February 2025