Run Microsoft Phi-4 on Ubuntu: Comprehensive Setup Guide

Last updated April 2026 — refreshed for current model/tool versions.

Microsoft's Phi-4 family has grown from a single 14B text model into a six-model ecosystem covering text, vision, audio, and multi-step reasoning — all under the MIT license. This guide covers every variant, gives you current hardware targets, and walks through three proven installation paths on Ubuntu 20.04 LTS and later.

What changed since early 2025Six models, not one. The Phi-4 family now includes Phi-4 (14B text), Phi-4-mini (3.8B text), Phi-4-multimodal (5.6B text+vision+audio), Phi-4-reasoning (14B reasoning-tuned), Phi-4-reasoning-plus (14B, RL-enhanced), and Phi-4-reasoning-vision-15B (released March 4, 2026).Transformers ≥ 4.49.0 required for the Phi-4 base and multimodal variants; Phi-4-mini-reasoning needs ≥ 4.51.3.Official Ollama tags exist for every model: phi4, phi4-mini, phi4-reasoning, phi4-mini-reasoning — no more third-party community tags needed.Phi-4-reasoning-vision-15B adds multimodal reasoning (charts, GUI, math diagrams) as of March 2026, built on the SigLIP-2 vision encoder.Hardware floor dropped. Phi-4-mini Q4_K_M runs in about 3 GB VRAM; Phi-4 14B Q4_K_M fits on a 12 GB GPU.Phi-4 hits #1 on Hugging Face OpenASR leaderboard (multimodal variant, 6.14% WER) as of March 2025 — beating Whisper v3.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

The Phi-4 Model Family at a Glance

Model	Parameters	Modalities	Released	Best for	VRAM (Q4)
Phi-4	14B	Text	Dec 2024	Reasoning, math, code	~10 GB
Phi-4-mini	3.8B	Text	Feb 2025	Edge, function calling, multilingual	~3 GB
Phi-4-multimodal	5.6B	Text + Vision + Audio	Feb 2025	Image QA, speech ASR, document OCR	~5 GB
Phi-4-reasoning	14B	Text	Apr 30, 2025	STEM, olympiad math, code	~10 GB
Phi-4-reasoning-plus	14B	Text	Apr 30, 2025	Harder math, RL-enhanced traces	~10 GB
Phi-4-reasoning-vision-15B	15B	Text + Vision	Mar 4, 2026	Chart/GUI understanding, math diagrams	~12 GB

All models are MIT-licensed and available on Hugging Face, Azure AI Foundry, GitHub Models, and — except for the reasoning-vision model — Ollama.

System Requirements

Ubuntu 20.04 LTS (Focal) and later are supported. Ubuntu 22.04 LTS or 24.04 LTS are recommended for CUDA 12.x driver compatibility.

Hardware Targets by Use Case

Scenario	GPU	VRAM	System RAM	Model recommendation
Laptop / budget desktop	RTX 3060, RTX 4060	8–12 GB	16 GB	Phi-4-mini Q4_K_M or Phi-4 Q4_K_M
Workstation	RTX 3090, RTX 4090, A6000	24–48 GB	32 GB	Phi-4 or Phi-4-reasoning FP16
Research / multimodal	A100-80G, H100	80 GB	64 GB	Phi-4-multimodal or Phi-4-reasoning-vision-15B
CPU-only (slow)	—	—	32 GB+	Phi-4-mini Q4_K_M via llama.cpp

Python: 3.10 recommended (3.8 minimum)
CUDA: 12.x (for Flash Attention 2 and torch 2.6.0)
Disk: 10–30 GB free depending on model

Preparing Your Ubuntu Environment

Update System Packages

sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv git curl build-essential -y

Verify GPU and CUDA

nvidia-smi          # shows driver version and GPU name
nvcc --version      # shows CUDA toolkit version

If nvidia-smi fails, install the NVIDIA driver for your GPU from developer.nvidia.com/cuda-downloads. CUDA 12.4 or later is recommended for full compatibility with PyTorch 2.6.0 and Flash Attention 2.7.

Installation Methods Overview

Method	Best for	Setup difficulty	Flexibility	GPU required
Ollama	Quick chat, all Phi-4 text models	Easiest	Moderate	Recommended, not mandatory
Python / Transformers	Custom scripts, multimodal, fine-tuning	Moderate	Highest	Yes (Flash Attention 2)
vLLM	High-throughput API serving	Advanced	High	Yes (NVIDIA only)

Method 1: Ollama (Fastest Path)

Ollama now ships official Microsoft Phi-4 images for every text model in the family. This is the fastest way to get any Phi-4 variant running. If you are also setting up a full local AI agent stack, see the OpenClaw + Ollama setup guide for running local AI agents which covers agent orchestration on top of Ollama.

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify: ollama --version — you need at least 0.5.13 for Phi-4-mini and the reasoning models.

Step 2: Start the Ollama Server

ollama serve

This runs in the foreground. Open a second terminal for model commands, or run it in the background with ollama serve &.

Step 3: Choose and Pull a Model

# Base reasoning model, 14B — general-purpose
ollama pull phi4

# Mini model, 3.8B — fits on 8 GB VRAM
ollama pull phi4-mini

# Reasoning-tuned 14B — math, science, code
ollama pull phi4-reasoning

# Mini reasoning — 3.8B, STEM focus
ollama pull phi4-mini-reasoning

Each model download is 3–10 GB depending on the quantization. Ollama stores models in ~/.ollama/models by default.

Step 4: Start an Interactive Chat

ollama run phi4-reasoning

For the reasoning models, Ollama parses the chain-of-thought output automatically. You will see a <think>...</think> block followed by the final answer.

Optional: Web UI with Open WebUI

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser and connect it to your local Ollama instance.

Method 2: Python + Hugging Face Transformers

This method gives you full control over inference parameters, multimodal inputs, and custom pipelines. It is the only option for Phi-4-multimodal and Phi-4-reasoning-vision-15B.

Step 1: Create a Virtual Environment

python3 -m venv phi4env
source phi4env/bin/activate

Step 2: Install Dependencies

For Phi-4, Phi-4-reasoning, and Phi-4-multimodal, use these tested versions:

pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.49.0 accelerate==1.3.0 peft==0.13.2
pip install flash_attn==2.7.4.post1  # requires CUDA 12.x
pip install "huggingface_hub[cli]"

For Phi-4-multimodal only, also install:

pip install soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 backoff==2.2.1

For Phi-4-mini-reasoning (uses a newer tokenizer):

pip install transformers==4.51.3

Step 3: Download Model Weights

mkdir -p ~/models/phi4

# Phi-4 base (14B text)
huggingface-cli download microsoft/phi-4 --local-dir ~/models/phi4

# Or Phi-4-reasoning
huggingface-cli download microsoft/Phi-4-reasoning --local-dir ~/models/phi4-reasoning

# Or Phi-4-multimodal
huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ~/models/phi4-multimodal

You may need to authenticate for gated models: huggingface-cli login

Step 4a: Text Inference (Phi-4 / Phi-4-reasoning)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "microsoft/Phi-4-reasoning"  # or "microsoft/phi-4"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "system", "content": "You are Phi, a helpful AI assistant."},
    {"role": "user", "content": "Solve: x^2 - 5x + 6 = 0, showing every step."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=4096,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note for Phi-4-reasoning: Set max_new_tokens=32768 for complex problems — the chain-of-thought output is long by design.

Step 4b: Multimodal Inference (Phi-4-multimodal)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

model_path = "microsoft/Phi-4-multimodal-instruct"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    _attn_implementation="flash_attention_2",  # use "eager" for older GPUs (V100 or earlier)
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path)

# Image + text prompt
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

prompt = "<|user|><|image_1|>What insect is this, and what distinguishing features do you see?<|end|><|assistant|>"
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, max_new_tokens=500, generation_config=generation_config)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(response)

Method 3: vLLM (High-Throughput API Serving)

Use vLLM when you need to serve Phi-4 as an OpenAI-compatible REST API, handle concurrent requests, or run continuous batching for production workloads.

Step 1: Install vLLM

pip install vllm

Step 2a: Serve Phi-4 (Text)

vllm serve microsoft/phi-4 \
  --host 0.0.0.0 \
  --port 7000 \
  --dtype auto

Step 2b: Serve Phi-4-reasoning

The reasoning model requires additional flags to parse the chain-of-thought output:

vllm serve microsoft/Phi-4-reasoning \
  --host 0.0.0.0 \
  --port 7000 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

Step 2c: Serve a GGUF-Quantized Model

# Download Phi-4 Q4_K_M GGUF (fits 12 GB VRAM)
huggingface-cli download bartowski/phi-4-GGUF \
  phi-4-Q4_K_M.gguf --local-dir ./models

vllm serve ./models/phi-4-Q4_K_M.gguf \
  --tokenizer microsoft/phi-4 \
  --host 0.0.0.0 \
  --port 7000

Step 3: Test the API

curl http://localhost:7000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/phi-4",
    "messages": [{"role": "user", "content": "Explain gradient descent in one paragraph."}]
  }'

How to Choose the Right Phi-4 Variant

Use this decision tree to pick the right model for your workload:

Need to run on 8 GB VRAM or less? → Phi-4-mini (Q4_K_M, ~3 GB) or Phi-4 (Q4_K_M, ~10 GB)
Need math, science, or code reasoning? → Phi-4-reasoning or Phi-4-reasoning-plus (14B, MIT). These beat DeepSeek-R1-Distill-Llama-70B — a model 5× larger — on most math benchmarks.
Need image or audio understanding? → Phi-4-multimodal (5.6B, handles images + audio)
Need chart, GUI, or diagram analysis with structured reasoning? → Phi-4-reasoning-vision-15B (March 2026, 15B)
Need function calling on a small model? → Phi-4-mini (built-in function calling support)
Need edge deployment (mobile, IoT)? → Phi-4-mini or Phi-4-mini-reasoning

Performance & Benchmarks

Phi-4-reasoning (14B) vs. Larger Models

Benchmark	Phi-4-reasoning	DeepSeek-R1	o1-mini
AIME 2024	75.3%	78.7%	63.6%
AIME 2025	62.9%	70.4%	54.8%
GPQA-Diamond	65.8%	73.0%	60.0%
HumanEval+	92.9%	—	—
MMLU-Pro	74.3%	—	—

Source: Phi-4-reasoning model card on Hugging Face, April 2025.

Phi-4 Base (14B) General Benchmarks

Benchmark	Phi-4	Notes
MMLU	84.8%	vs. Phi-3's 77.9%
HumanEval (code)	82.6%	—
MATH (competition)	56.1%	—

Phi-4-multimodal (5.6B) Benchmark Highlights

Task	Score	Notes
OpenASR WER	6.14%	#1 on Hugging Face OpenASR leaderboard (Mar 2025)
MMMU	55.1%	Competitive with larger vision models
MMBench	86.7%	—
DocVQA	93.2%	—

Phi-4-reasoning-vision-15B Benchmark Highlights

Benchmark	Score
AI2D_TEST (diagram)	84.8%
ChartQA_TEST	83.3%
MathVista_MINI	75.2%
ScreenSpot_v2 (GUI)	88.2%
MMMU_VAL	54.3%

Source: Microsoft Foundry announcement, March 2026.

Practical Performance Notes

Phi-4-reasoning at Q4_K_M on an RTX 4060 Ti (16 GB VRAM) generates 10–15 tokens/sec. The chain-of-thought traces can be very long — budget extra wait time for hard math problems.
Phi-4-mini runs at 50–80 tokens/sec on the same hardware (Q4_K_M), making it more suitable for interactive applications.
CPU-only inference with Phi-4 Q4_K_M via llama.cpp is around 3–5 tokens/sec on a modern 16-core CPU — usable for batch jobs, not chat.

LM Studio: GUI Alternative to Ollama

If you prefer a graphical interface without using Docker, LM Studio supports all Phi-4 GGUF models. Download the Linux AppImage, search for "phi-4" or "phi-4-reasoning" in the built-in model browser, and download the Q4_K_M quantization for your VRAM target. LM Studio also exposes a local OpenAI-compatible API on port 1234 that integrates directly with tools like Continue (VS Code extension) and Cursor.

Common Pitfalls & Troubleshooting

CUDA out of memory: Drop to a lower quantization. Phi-4 (14B) at Q8_0 needs ~16 GB VRAM; at Q4_K_M it needs ~10 GB. Use nvidia-smi to check free VRAM before loading.
Flash Attention 2 install fails: This package compiles from source and requires g++, ninja, and CUDA headers. Install with pip install flash_attn --no-build-isolation after verifying nvcc --version matches your PyTorch CUDA build.
Older GPU (V100 or earlier): Flash Attention 2 does not support sm70 GPUs. Use _attn_implementation="eager" instead. Expect 20–40% slower inference.
Phi-4-reasoning extremely slow: This is a known community observation — the reasoning model generates very long internal traces. Set max_new_tokens to a reasonable ceiling (8192–16384) to avoid runaway generation. For speed-critical use cases, use the base Phi-4 or Phi-4-mini instead.
vLLM engine process failed: Ensure vLLM version is compatible with your PyTorch version. Run pip show vllm and check the vLLM changelog for your version's PyTorch dependency. A mismatch between flash-attn and PyTorch CUDA versions is the most common cause.
LoRA adapter errors with Phi-4-multimodal: Ensure peft==0.13.2 is installed and that the adapter config points to the correct base model path.
HuggingFace download 401 errors: Some Phi-4 variants require accepting Microsoft's license on the model card page. Run huggingface-cli login with your token after accepting the license on the Hugging Face website.
Ollama model not found: Third-party community tags like vanilj/Phi-4 are now superseded by official Microsoft tags. Use phi4, phi4-mini, phi4-reasoning, or phi4-mini-reasoning.

Best Practices for Production and Research

Always use a virtual environment to isolate Phi-4 dependencies from system Python packages.
Monitor GPU temperature during long reasoning runs: watch -n 2 nvidia-smi. The reasoning models generate more tokens per query, producing more heat.
SSH key authentication for remote servers — never use password auth when exposing Ollama or vLLM APIs, even on private networks.
Bind vLLM to localhost by default (--host 127.0.0.1). Only expose externally behind a reverse proxy with authentication.
Pin dependency versions in your requirements.txt. The Phi-4 ecosystem moves fast — a transformers upgrade can break multimodal processor compatibility.
Use device_map="auto" when you have multiple GPUs. Transformers will shard the model automatically across available VRAM.

Need a Developer Who Knows Local AI?

If you are building production AI infrastructure — RAG pipelines, multimodal document processing, or local inference APIs — Codersera's vetted AI engineers can accelerate delivery without the sourcing overhead. All developers are pre-screened for technical fit.

FAQ

Is Phi-4 better than Llama 4?

They target different use cases. Phi-4-reasoning (14B) beats Llama 3-based distillations in math benchmarks but is narrower in general knowledge. Meta's Llama 4 Scout and Llama 4 Maverick (released April 2025) are larger multi-expert models better suited for broad general-purpose tasks. For reasoning and math on constrained hardware, Phi-4-reasoning is the better choice. For general RAG or chat assistants, Llama 4 is more versatile.

Should I use Phi-4-multimodal or Whisper for speech recognition?

Phi-4-multimodal held the #1 position on the Hugging Face OpenASR leaderboard with a 6.14% WER as of March 2025, ahead of Whisper v3's score. If you need ASR as part of a multimodal pipeline that also does image or text tasks, Phi-4-multimodal eliminates the need for a separate ASR model. If you need only ASR and want a smaller, specialized model, Whisper v3 or Whisper large-v3 remains simpler to deploy.

Does Phi-4-mini support function calling?

Yes. Phi-4-mini has built-in function calling support, making it suitable for agent frameworks and tool-use scenarios on constrained hardware. The Phi-4 base model and the reasoning variants do not have native function calling baked in — you need prompt engineering or a framework like LangChain for that.

Can I run Phi-4 without a GPU?

Yes, via llama.cpp or Ollama on CPU, but it is slow. Phi-4-mini Q4_K_M at 3–5 tokens/sec on a modern CPU is usable for batch tasks. Phi-4 14B on CPU is practically unusable for interactive chat. If you need GPU-free inference at speed, consider Phi-4-mini-reasoning, which fits in ~3 GB RAM in its quantized form and runs faster on CPU than the 14B models.

What is the difference between Phi-4 and Phi-4-reasoning?

Phi-4 is a general-purpose 14B model trained primarily on synthetic data for complex reasoning tasks. Phi-4-reasoning is the same base model fine-tuned on 1.4 million STEM and coding chain-of-thought traces, then reinforcement-learned to improve math and science accuracy. Phi-4-reasoning consistently scores 10–15 points higher on olympiad math benchmarks but generates much longer output and is slower per query. Use Phi-4 for chat and code; use Phi-4-reasoning when accuracy on hard math or logic problems is the priority.

Is Phi-4 free for commercial use?

Yes. All current Phi-4 variants — including Phi-4, Phi-4-mini, Phi-4-multimodal, Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-reasoning-vision-15B — are released under the MIT license, which permits commercial use, modification, and redistribution.

Which Ubuntu version should I use?

Ubuntu 22.04 LTS (Jammy Jellyfish) is the safest choice as of 2026. It ships with Python 3.10, has long-term support through 2027 (standard) and 2032 (extended), and all NVIDIA CUDA 12.x drivers have tested Ubuntu 22.04 packages. Ubuntu 24.04 LTS (Noble Numbat) also works but is newer, and some CUDA driver packages lag by a few weeks for new CUDA versions. Ubuntu 20.04 LTS (Focal) is still supported by Phi-4 but reaches end-of-standard-support in April 2025 — upgrade is recommended.