Run Microsoft Phi-4 on Ubuntu: Complete 2026 Guide (All 6 Models)
Last updated April 2026 — refreshed for current model/tool versions.
Microsoft's Phi-4 family has grown from a single 14B text model into a six-model ecosystem covering text, vision, audio, and multi-step reasoning — all under the MIT license. This guide covers every variant, gives you current hardware targets, and walks through three proven installation paths on Ubuntu 20.04 LTS and later.
What changed since early 2025Six models, not one. The Phi-4 family now includes Phi-4 (14B text), Phi-4-mini (3.8B text), Phi-4-multimodal (5.6B text+vision+audio), Phi-4-reasoning (14B reasoning-tuned), Phi-4-reasoning-plus (14B, RL-enhanced), and Phi-4-reasoning-vision-15B (released March 4, 2026).Transformers ≥ 4.49.0 required for the Phi-4 base and multimodal variants; Phi-4-mini-reasoning needs ≥ 4.51.3.Official Ollama tags exist for every model:phi4,phi4-mini,phi4-reasoning,phi4-mini-reasoning— no more third-party community tags needed.Phi-4-reasoning-vision-15B adds multimodal reasoning (charts, GUI, math diagrams) as of March 2026, built on the SigLIP-2 vision encoder.Hardware floor dropped. Phi-4-mini Q4_K_M runs in about 3 GB VRAM; Phi-4 14B Q4_K_M fits on a 12 GB GPU.Phi-4 hits #1 on Hugging Face OpenASR leaderboard (multimodal variant, 6.14% WER) as of March 2025 — beating Whisper v3.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
The Phi-4 Model Family at a Glance
| Model | Parameters | Modalities | Released | Best for | VRAM (Q4) |
|---|---|---|---|---|---|
| Phi-4 | 14B | Text | Dec 2024 | Reasoning, math, code | ~10 GB |
| Phi-4-mini | 3.8B | Text | Feb 2025 | Edge, function calling, multilingual | ~3 GB |
| Phi-4-multimodal | 5.6B | Text + Vision + Audio | Feb 2025 | Image QA, speech ASR, document OCR | ~5 GB |
| Phi-4-reasoning | 14B | Text | Apr 30, 2025 | STEM, olympiad math, code | ~10 GB |
| Phi-4-reasoning-plus | 14B | Text | Apr 30, 2025 | Harder math, RL-enhanced traces | ~10 GB |
| Phi-4-reasoning-vision-15B | 15B | Text + Vision | Mar 4, 2026 | Chart/GUI understanding, math diagrams | ~12 GB |
All models are MIT-licensed and available on Hugging Face, Azure AI Foundry, GitHub Models, and — except for the reasoning-vision model — Ollama.
System Requirements
Ubuntu 20.04 LTS (Focal) and later are supported. Ubuntu 22.04 LTS or 24.04 LTS are recommended for CUDA 12.x driver compatibility.
Hardware Targets by Use Case
| Scenario | GPU | VRAM | System RAM | Model recommendation |
|---|---|---|---|---|
| Laptop / budget desktop | RTX 3060, RTX 4060 | 8–12 GB | 16 GB | Phi-4-mini Q4_K_M or Phi-4 Q4_K_M |
| Workstation | RTX 3090, RTX 4090, A6000 | 24–48 GB | 32 GB | Phi-4 or Phi-4-reasoning FP16 |
| Research / multimodal | A100-80G, H100 | 80 GB | 64 GB | Phi-4-multimodal or Phi-4-reasoning-vision-15B |
| CPU-only (slow) | — | — | 32 GB+ | Phi-4-mini Q4_K_M via llama.cpp |
- Python: 3.10 recommended (3.8 minimum)
- CUDA: 12.x (for Flash Attention 2 and
torch 2.6.0) - Disk: 10–30 GB free depending on model
Preparing Your Ubuntu Environment
Update System Packages
sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv git curl build-essential -yVerify GPU and CUDA
nvidia-smi # shows driver version and GPU name
nvcc --version # shows CUDA toolkit versionIf nvidia-smi fails, install the NVIDIA driver for your GPU from developer.nvidia.com/cuda-downloads. CUDA 12.4 or later is recommended for full compatibility with PyTorch 2.6.0 and Flash Attention 2.7.
Installation Methods Overview
| Method | Best for | Setup difficulty | Flexibility | GPU required |
|---|---|---|---|---|
| Ollama | Quick chat, all Phi-4 text models | Easiest | Moderate | Recommended, not mandatory |
| Python / Transformers | Custom scripts, multimodal, fine-tuning | Moderate | Highest | Yes (Flash Attention 2) |
| vLLM | High-throughput API serving | Advanced | High | Yes (NVIDIA only) |
Method 1: Ollama (Fastest Path)
Ollama now ships official Microsoft Phi-4 images for every text model in the family. This is the fastest way to get any Phi-4 variant running. If you are also setting up a full local AI agent stack, see the OpenClaw + Ollama setup guide for running local AI agents which covers agent orchestration on top of Ollama.
Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | shVerify: ollama --version — you need at least 0.5.13 for Phi-4-mini and the reasoning models.
Step 2: Start the Ollama Server
ollama serveThis runs in the foreground. Open a second terminal for model commands, or run it in the background with ollama serve &.
Step 3: Choose and Pull a Model
# Base reasoning model, 14B — general-purpose
ollama pull phi4
# Mini model, 3.8B — fits on 8 GB VRAM
ollama pull phi4-mini
# Reasoning-tuned 14B — math, science, code
ollama pull phi4-reasoning
# Mini reasoning — 3.8B, STEM focus
ollama pull phi4-mini-reasoningEach model download is 3–10 GB depending on the quantization. Ollama stores models in ~/.ollama/models by default.
Step 4: Start an Interactive Chat
ollama run phi4-reasoningFor the reasoning models, Ollama parses the chain-of-thought output automatically. You will see a <think>...</think> block followed by the final answer.
Optional: Web UI with Open WebUI
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:mainThen open http://localhost:3000 in your browser and connect it to your local Ollama instance.
Method 2: Python + Hugging Face Transformers
This method gives you full control over inference parameters, multimodal inputs, and custom pipelines. It is the only option for Phi-4-multimodal and Phi-4-reasoning-vision-15B.
Step 1: Create a Virtual Environment
python3 -m venv phi4env
source phi4env/bin/activateStep 2: Install Dependencies
For Phi-4, Phi-4-reasoning, and Phi-4-multimodal, use these tested versions:
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.49.0 accelerate==1.3.0 peft==0.13.2
pip install flash_attn==2.7.4.post1 # requires CUDA 12.x
pip install "huggingface_hub[cli]"For Phi-4-multimodal only, also install:
pip install soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 backoff==2.2.1For Phi-4-mini-reasoning (uses a newer tokenizer):
pip install transformers==4.51.3Step 3: Download Model Weights
mkdir -p ~/models/phi4
# Phi-4 base (14B text)
huggingface-cli download microsoft/phi-4 --local-dir ~/models/phi4
# Or Phi-4-reasoning
huggingface-cli download microsoft/Phi-4-reasoning --local-dir ~/models/phi4-reasoning
# Or Phi-4-multimodal
huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ~/models/phi4-multimodalYou may need to authenticate for gated models: huggingface-cli login
Step 4a: Text Inference (Phi-4 / Phi-4-reasoning)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "microsoft/Phi-4-reasoning" # or "microsoft/phi-4"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
)
messages = [
{"role": "system", "content": "You are Phi, a helpful AI assistant."},
{"role": "user", "content": "Solve: x^2 - 5x + 6 = 0, showing every step."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=4096,
temperature=0.8,
top_k=50,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Note for Phi-4-reasoning: Set max_new_tokens=32768 for complex problems — the chain-of-thought output is long by design.
Step 4b: Multimodal Inference (Phi-4-multimodal)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
_attn_implementation="flash_attention_2", # use "eager" for older GPUs (V100 or earlier)
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)
# Image + text prompt
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
prompt = "<|user|><|image_1|>What insect is this, and what distinguishing features do you see?<|end|><|assistant|>"
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs, max_new_tokens=500, generation_config=generation_config)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(response)Method 3: vLLM (High-Throughput API Serving)
Use vLLM when you need to serve Phi-4 as an OpenAI-compatible REST API, handle concurrent requests, or run continuous batching for production workloads.
Step 1: Install vLLM
pip install vllmStep 2a: Serve Phi-4 (Text)
vllm serve microsoft/phi-4 \
--host 0.0.0.0 \
--port 7000 \
--dtype autoStep 2b: Serve Phi-4-reasoning
The reasoning model requires additional flags to parse the chain-of-thought output:
vllm serve microsoft/Phi-4-reasoning \
--host 0.0.0.0 \
--port 7000 \
--enable-reasoning \
--reasoning-parser deepseek_r1Step 2c: Serve a GGUF-Quantized Model
# Download Phi-4 Q4_K_M GGUF (fits 12 GB VRAM)
huggingface-cli download bartowski/phi-4-GGUF \
phi-4-Q4_K_M.gguf --local-dir ./models
vllm serve ./models/phi-4-Q4_K_M.gguf \
--tokenizer microsoft/phi-4 \
--host 0.0.0.0 \
--port 7000Step 3: Test the API
curl http://localhost:7000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/phi-4",
"messages": [{"role": "user", "content": "Explain gradient descent in one paragraph."}]
}'How to Choose the Right Phi-4 Variant
Use this decision tree to pick the right model for your workload:
- Need to run on 8 GB VRAM or less? → Phi-4-mini (Q4_K_M, ~3 GB) or Phi-4 (Q4_K_M, ~10 GB)
- Need math, science, or code reasoning? → Phi-4-reasoning or Phi-4-reasoning-plus (14B, MIT). These beat DeepSeek-R1-Distill-Llama-70B — a model 5× larger — on most math benchmarks.
- Need image or audio understanding? → Phi-4-multimodal (5.6B, handles images + audio)
- Need chart, GUI, or diagram analysis with structured reasoning? → Phi-4-reasoning-vision-15B (March 2026, 15B)
- Need function calling on a small model? → Phi-4-mini (built-in function calling support)
- Need edge deployment (mobile, IoT)? → Phi-4-mini or Phi-4-mini-reasoning
Performance & Benchmarks
Phi-4-reasoning (14B) vs. Larger Models
| Benchmark | Phi-4-reasoning | DeepSeek-R1 | o1-mini |
|---|---|---|---|
| AIME 2024 | 75.3% | 78.7% | 63.6% |
| AIME 2025 | 62.9% | 70.4% | 54.8% |
| GPQA-Diamond | 65.8% | 73.0% | 60.0% |
| HumanEval+ | 92.9% | — | — |
| MMLU-Pro | 74.3% | — | — |
Source: Phi-4-reasoning model card on Hugging Face, April 2025.
Phi-4 Base (14B) General Benchmarks
| Benchmark | Phi-4 | Notes |
|---|---|---|
| MMLU | 84.8% | vs. Phi-3's 77.9% |
| HumanEval (code) | 82.6% | — |
| MATH (competition) | 56.1% | — |
Phi-4-multimodal (5.6B) Benchmark Highlights
| Task | Score | Notes |
|---|---|---|
| OpenASR WER | 6.14% | #1 on Hugging Face OpenASR leaderboard (Mar 2025) |
| MMMU | 55.1% | Competitive with larger vision models |
| MMBench | 86.7% | — |
| DocVQA | 93.2% | — |
Phi-4-reasoning-vision-15B Benchmark Highlights
| Benchmark | Score |
|---|---|
| AI2D_TEST (diagram) | 84.8% |
| ChartQA_TEST | 83.3% |
| MathVista_MINI | 75.2% |
| ScreenSpot_v2 (GUI) | 88.2% |
| MMMU_VAL | 54.3% |
Source: Microsoft Foundry announcement, March 2026.
Practical Performance Notes
- Phi-4-reasoning at Q4_K_M on an RTX 4060 Ti (16 GB VRAM) generates 10–15 tokens/sec. The chain-of-thought traces can be very long — budget extra wait time for hard math problems.
- Phi-4-mini runs at 50–80 tokens/sec on the same hardware (Q4_K_M), making it more suitable for interactive applications.
- CPU-only inference with Phi-4 Q4_K_M via llama.cpp is around 3–5 tokens/sec on a modern 16-core CPU — usable for batch jobs, not chat.
LM Studio: GUI Alternative to Ollama
If you prefer a graphical interface without using Docker, LM Studio supports all Phi-4 GGUF models. Download the Linux AppImage, search for "phi-4" or "phi-4-reasoning" in the built-in model browser, and download the Q4_K_M quantization for your VRAM target. LM Studio also exposes a local OpenAI-compatible API on port 1234 that integrates directly with tools like Continue (VS Code extension) and Cursor.
Common Pitfalls & Troubleshooting
- CUDA out of memory: Drop to a lower quantization. Phi-4 (14B) at Q8_0 needs ~16 GB VRAM; at Q4_K_M it needs ~10 GB. Use
nvidia-smito check free VRAM before loading. - Flash Attention 2 install fails: This package compiles from source and requires
g++,ninja, and CUDA headers. Install withpip install flash_attn --no-build-isolationafter verifyingnvcc --versionmatches your PyTorch CUDA build. - Older GPU (V100 or earlier): Flash Attention 2 does not support sm70 GPUs. Use
_attn_implementation="eager"instead. Expect 20–40% slower inference. - Phi-4-reasoning extremely slow: This is a known community observation — the reasoning model generates very long internal traces. Set
max_new_tokensto a reasonable ceiling (8192–16384) to avoid runaway generation. For speed-critical use cases, use the base Phi-4 or Phi-4-mini instead. - vLLM engine process failed: Ensure vLLM version is compatible with your PyTorch version. Run
pip show vllmand check the vLLM changelog for your version's PyTorch dependency. A mismatch between flash-attn and PyTorch CUDA versions is the most common cause. - LoRA adapter errors with Phi-4-multimodal: Ensure
peft==0.13.2is installed and that the adapter config points to the correct base model path. - HuggingFace download 401 errors: Some Phi-4 variants require accepting Microsoft's license on the model card page. Run
huggingface-cli loginwith your token after accepting the license on the Hugging Face website. - Ollama model not found: Third-party community tags like
vanilj/Phi-4are now superseded by official Microsoft tags. Usephi4,phi4-mini,phi4-reasoning, orphi4-mini-reasoning.
Best Practices for Production and Research
- Always use a virtual environment to isolate Phi-4 dependencies from system Python packages.
- Monitor GPU temperature during long reasoning runs:
watch -n 2 nvidia-smi. The reasoning models generate more tokens per query, producing more heat. - SSH key authentication for remote servers — never use password auth when exposing Ollama or vLLM APIs, even on private networks.
- Bind vLLM to localhost by default (
--host 127.0.0.1). Only expose externally behind a reverse proxy with authentication. - Pin dependency versions in your
requirements.txt. The Phi-4 ecosystem moves fast — atransformersupgrade can break multimodal processor compatibility. - Use
device_map="auto"when you have multiple GPUs. Transformers will shard the model automatically across available VRAM.
Need a Developer Who Knows Local AI?
If you are building production AI infrastructure — RAG pipelines, multimodal document processing, or local inference APIs — Codersera's vetted AI engineers can accelerate delivery without the sourcing overhead. All developers are pre-screened for technical fit.
FAQ
Is Phi-4 better than Llama 4?
They target different use cases. Phi-4-reasoning (14B) beats Llama 3-based distillations in math benchmarks but is narrower in general knowledge. Meta's Llama 4 Scout and Llama 4 Maverick (released April 2025) are larger multi-expert models better suited for broad general-purpose tasks. For reasoning and math on constrained hardware, Phi-4-reasoning is the better choice. For general RAG or chat assistants, Llama 4 is more versatile.
Should I use Phi-4-multimodal or Whisper for speech recognition?
Phi-4-multimodal held the #1 position on the Hugging Face OpenASR leaderboard with a 6.14% WER as of March 2025, ahead of Whisper v3's score. If you need ASR as part of a multimodal pipeline that also does image or text tasks, Phi-4-multimodal eliminates the need for a separate ASR model. If you need only ASR and want a smaller, specialized model, Whisper v3 or Whisper large-v3 remains simpler to deploy.
Does Phi-4-mini support function calling?
Yes. Phi-4-mini has built-in function calling support, making it suitable for agent frameworks and tool-use scenarios on constrained hardware. The Phi-4 base model and the reasoning variants do not have native function calling baked in — you need prompt engineering or a framework like LangChain for that.
Can I run Phi-4 without a GPU?
Yes, via llama.cpp or Ollama on CPU, but it is slow. Phi-4-mini Q4_K_M at 3–5 tokens/sec on a modern CPU is usable for batch tasks. Phi-4 14B on CPU is practically unusable for interactive chat. If you need GPU-free inference at speed, consider Phi-4-mini-reasoning, which fits in ~3 GB RAM in its quantized form and runs faster on CPU than the 14B models.
What is the difference between Phi-4 and Phi-4-reasoning?
Phi-4 is a general-purpose 14B model trained primarily on synthetic data for complex reasoning tasks. Phi-4-reasoning is the same base model fine-tuned on 1.4 million STEM and coding chain-of-thought traces, then reinforcement-learned to improve math and science accuracy. Phi-4-reasoning consistently scores 10–15 points higher on olympiad math benchmarks but generates much longer output and is slower per query. Use Phi-4 for chat and code; use Phi-4-reasoning when accuracy on hard math or logic problems is the priority.
Is Phi-4 free for commercial use?
Yes. All current Phi-4 variants — including Phi-4, Phi-4-mini, Phi-4-multimodal, Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-reasoning-vision-15B — are released under the MIT license, which permits commercial use, modification, and redistribution.
Which Ubuntu version should I use?
Ubuntu 22.04 LTS (Jammy Jellyfish) is the safest choice as of 2026. It ships with Python 3.10, has long-term support through 2027 (standard) and 2032 (extended), and all NVIDIA CUDA 12.x drivers have tested Ubuntu 22.04 packages. Ubuntu 24.04 LTS (Noble Numbat) also works but is newer, and some CUDA driver packages lag by a few weeks for new CUDA versions. Ubuntu 20.04 LTS (Focal) is still supported by Phi-4 but reaches end-of-standard-support in April 2025 — upgrade is recommended.
References & Further Reading
- microsoft/phi-4 — Hugging Face Model Card
- microsoft/Phi-4-multimodal-instruct — Hugging Face Model Card
- microsoft/Phi-4-reasoning — Hugging Face Model Card
- microsoft/Phi-4-reasoning-vision-15B — Hugging Face Model Card
- Introducing Phi-4-Reasoning-Vision to Microsoft Foundry (March 2026)
- Empowering innovation: The next generation of the Phi family — Microsoft Azure Blog
- phi4-reasoning — Ollama Library
- Phi-4-reasoning-vision-15B Technical Report — arXiv:2603.03975