Run Microsoft Phi-4 Mini on Linux: Complete 2026 Guide (Ollama, Transformers, llama.cpp)
Last updated April 2026 — refreshed for current model/tool versions.
Microsoft's Phi-4 Mini is a 3.8-billion-parameter language model that punches well above its weight class: it matches Llama 3.1 8B on MMLU, beats it on MATH (62% vs 52%), and runs comfortably on a consumer GPU with just 4 GB of VRAM. This guide covers every practical path to running it on Linux — Ollama, HuggingFace Transformers, vLLM, and llama.cpp with GGUF — including the 2025/2026 reasoning variants that didn't exist when the original post was written.
What changed since early 2025Phi-4-mini-reasoning (April 2025): A math-specialist fine-tune that raises AIME accuracy from 10.0 to 57.5 and MATH-500 from 71.8 to 94.6. It is a separate model, not an update to the base instruct model.Phi-4-mini-flash-reasoning (July 2025): Hybrid architecture that delivers up to 10× higher decode throughput vs. the reasoning variant while scoring 92.45% on MATH-500 — and it runs on 64K context instead of 128K.Ollama minimum version: Thephi4-miniOllama tag now requires Ollama 0.5.13 or later. Older installs will fail silently.Flash attention default changed: The HuggingFace model now enables flash attention by default. NVIDIA V100 and earlier GPUs must explicitly passattn_implementation="eager".ROCm parity (March 2026): ROCm 7.2 achieves out-of-the-box parity with CUDA for Ollama, llama.cpp, and vLLM — AMD GPU users no longer need custom patches.
| Attribute | Value |
|---|---|
| Parameters | 3.8 B |
| Architecture | Dense decoder-only Transformer (32 layers, LongRoPE) |
| Context length | 128K tokens |
| Vocabulary | 200,064 tokens (multilingual) |
| License | MIT (commercial use OK) |
| Release date | February 2025 |
| VRAM (Q4_K_M GGUF) | ~2.5 GB |
| VRAM (full BF16) | ~8 GB |
| MMLU (5-shot) | 67.3% |
| HumanEval (0-shot) | 74.4% |
| MATH benchmark | 64.0% |
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
The Phi-4 Family in 2026: Which Model Do You Need?
The "Phi-4 Mini" name now covers three distinct models. Picking the wrong one is the most common mistake new users make.
| Model | Best for | Context | MATH-500 | Ollama tag |
|---|---|---|---|---|
| Phi-4-mini-instruct | General chat, code, instruction following | 128K | 64.0% | phi4-mini |
| Phi-4-mini-reasoning | Complex multi-step math and logic | 128K | 94.6% | phi4-mini-reasoning |
| Phi-4-mini-flash-reasoning | Math at edge/mobile speed (10× faster than reasoning) | 64K | 92.45% | not yet on Ollama; HF only |
Unless you need heavy mathematical reasoning, start with phi4-mini (the instruct variant). If you are building on a local AI agent stack, the OpenClaw + Ollama setup guide for running local AI agents covers integrating Phi-4 Mini into a complete agent pipeline with tool use and memory.
Prerequisites and Hardware Requirements
Hardware
- GPU (recommended): Any NVIDIA GPU with ≥4 GB VRAM (Turing/RTX 20-series or newer) for full-speed inference. On a RTX 4090, expect ~300 tokens/second at Q4_K_M quantization.
- GPU flash attention support: A100, A6000, H100 — these use the default
flash_attention_2backend. V100 and older must passattn_implementation="eager". - AMD GPU: ROCm 7.2 (March 2026) provides full parity with CUDA for Ollama, llama.cpp, and vLLM.
- CPU-only: Supported via llama.cpp. A modern 8-core CPU generates ~10–20 tokens/second at Q4_K_M — usable for batch workloads, slow for interactive chat.
- RAM: Minimum 8 GB system RAM. 16 GB recommended if running the full BF16 model through HuggingFace Transformers.
- Disk: ~2.5 GB for the Q4_K_M GGUF; ~8 GB for the full BF16 HuggingFace weights.
Software
- Linux (Ubuntu 22.04+, Debian 12+, Fedora 39+, Arch — all work)
- Python 3.10 or later
- NVIDIA driver ≥ 525 and CUDA 12.x (for GPU paths)
- Ollama 0.5.13 or later (for the Ollama path)
Method 1: Ollama (Fastest Path, Recommended for Most Users)
Ollama wraps model download, quantization selection, and serving into a single command. It is the fastest way to get Phi-4 Mini running interactively.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama --version # confirm 0.5.13 or later
After installation, Ollama runs as a systemd service on port 11434. Verify it is up:
systemctl status ollama
Pull and Run Phi-4 Mini
# Instruct model (general use) — 2.5 GB download
ollama pull phi4-mini
# Start an interactive session
ollama run phi4-mini
For the math-reasoning variant:
ollama pull phi4-mini-reasoning
ollama run phi4-mini-reasoning
Call via REST API
curl http://localhost:11434/api/chat -d '{
"model": "phi4-mini",
"messages": [
{"role": "user", "content": "Write a Python function to binary search a sorted list."}
]
}'
Run Ollama in Docker
docker run -d \
--name ollama \
--gpus all \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
docker exec -it ollama ollama run phi4-mini
Method 2: HuggingFace Transformers (Full Control)
Use this path when you need function calling, custom system prompts, pipeline integration, or programmatic control over generation parameters.
Install Dependencies
pip install torch==2.5.1 \
transformers==4.49.0 \
accelerate==1.3.0 \
flash_attn==2.7.4.post1
If flash_attn fails to install (common on older GPUs), omit it — you will pass attn_implementation="eager" instead.
Basic Inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "microsoft/Phi-4-mini-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # spreads across available GPUs/CPU
torch_dtype="auto", # bfloat16 on modern GPUs
trust_remote_code=True,
# attn_implementation="eager" # uncomment for V100 or older GPUs
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain what LongRoPE positional encoding does."},
]
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe(messages, max_new_tokens=500, temperature=0.0, do_sample=False)
print(output[0]["generated_text"][-1]["content"])
Function Calling (New in Phi-4 Mini)
Phi-4 Mini adds native function-calling support that Phi-3.5 Mini lacked. Use the <|tool|> block in the system prompt:
system = """You are a helpful assistant with some tools.<|tool|>[
{
"name": "get_weather",
"description": "Return current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
]</tool|>"""
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "What is the weather in Berlin?"},
]
4-Bit Quantization via bitsandbytes (Low-VRAM GPU)
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-mini-instruct",
quantization_config=quant_config,
device_map="auto",
trust_remote_code=True,
)
Method 3: llama.cpp with GGUF (CPU-Friendly, Maximum Portability)
llama.cpp is ideal when you want CPU-only inference, the smallest possible binary, or maximum compatibility across hardware.
Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# GPU build (NVIDIA CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)
Download a GGUF Quantization
The Q4_K_M quantization is the best default for most users: 2.49 GB, minimal quality loss. For very low VRAM, Q3_K_M at 2.12 GB is the next step down.
pip install -U "huggingface_hub[cli]"
# Download Q4_K_M (recommended default)
huggingface-cli download bartowski/microsoft_Phi-4-mini-instruct-GGUF \
--include "microsoft_Phi-4-mini-instruct-Q4_K_M.gguf" \
--local-dir ./phi4-mini-gguf/
Quantization Guide
| Quantization | Size | Quality | Recommended when |
|---|---|---|---|
| Q8_0 | 4.08 GB | Near-lossless | You have ≥6 GB VRAM and want maximum fidelity |
| Q6_K | 3.16 GB | Very high | 5 GB VRAM, near-perfect quality |
| Q5_K_M | 2.85 GB | High | 4–5 GB VRAM, excellent quality/size tradeoff |
| Q4_K_M | 2.49 GB | Good (default) | Most users — 4 GB VRAM or 8 GB RAM (CPU) |
| Q3_K_M | 2.12 GB | Medium-low | Very tight RAM; expect some degradation |
| Q2_K | 1.68 GB | Low but usable | Emergency RAM constraint only |
Run Inference
# Interactive chat (CPU)
./build/bin/llama-cli \
-m ./phi4-mini-gguf/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf \
-p "<|system|>You are a helpful AI assistant.<|end|><|user|>Explain quantization.<|end|><|assistant|>" \
-n 512
# GPU-accelerated (offload all layers to CUDA)
./build/bin/llama-cli \
-m ./phi4-mini-gguf/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf \
-ngl 999 \
-p "<|system|>You are a helpful assistant.<|end|><|user|>Write a Rust function to parse JSON.<|end|><|assistant|>" \
-n 512
Method 4: vLLM (High-Throughput Production Serving)
vLLM is the right choice when you need to serve multiple concurrent users, need OpenAI-compatible API endpoints, or need maximum throughput from a GPU server. It uses PagedAttention for efficient KV-cache management.
pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
sampling_params = SamplingParams(max_tokens=500, temperature=0.0)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the main use cases for small language models?"},
]
output = llm.chat(messages=messages, sampling_params=sampling_params)
print(output[0].outputs[0].text)
Serve an OpenAI-compatible endpoint:
vllm serve microsoft/Phi-4-mini-instruct \
--trust-remote-code \
--port 8000
Method 5: LM Studio (GUI, No Terminal Required)
LM Studio provides a desktop GUI for Linux (AppImage), Windows, and macOS. It downloads GGUF models directly from Hugging Face.
- Download the LM Studio AppImage from lmstudio.ai.
- Search for
Phi-4-mini-instructin the model catalog. - Select
Q4_K_Mand click Download. - Click Run to start the local server.
LM Studio exposes an OpenAI-compatible API on localhost:1234, so any tool that speaks OpenAI can point to it.
Chat Prompt Format
If you are using llama.cpp directly, the template is:
<|system|>{system message}<|end|><|user|>{user message}<|end|><|assistant|>
If you are using HuggingFace Transformers, tokenizer.apply_chat_template(messages) handles this automatically.
Performance and Benchmarks
The following numbers are from the official Hugging Face model card (February 2025) and the Phi-4 technical reports. Do not treat these as absolute rankings — scores vary by prompt format and evaluation harness.
| Benchmark | Phi-4-mini-instruct (3.8B) | Llama 3.2 3B | Qwen2.5-7B | GPT-4o-mini |
|---|---|---|---|---|
| MMLU (5-shot) | 67.3% | ~58% | ~74% | 82% |
| HumanEval (0-shot) | 74.4% | ~55% | ~72% | 87% |
| MATH | 64.0% | ~48% | ~69% | ~80% |
| GSM8K | 88.6% | ~77% | ~91% | ~97% |
| BigBench Hard | 70.4% | — | — | — |
| ARC Challenge | 83.7% | ~78% | ~85% | ~96% |
Inference speed (RTX 4090, Q4_K_M GGUF): approximately 300 tokens/second — roughly 2× faster than running a 7–8B model at similar quantization on the same hardware.
Reasoning Variants: Benchmark Comparison
| Model | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|
| Phi-4-mini-instruct | 10.0 | 71.8% | 36.9% |
| Phi-4-mini-reasoning | 57.5 | 94.6% | 52.0% |
| Phi-4-mini-flash-reasoning | ~52% | 92.45% | — |
| o1-mini (reference) | 63.6 | 90.0% | 60.0% |
How to Choose Your Deployment Method
- Just want to try it, interactive chat: Ollama (
ollama run phi4-mini). - Need programmatic control, function calling, or pipeline integration: HuggingFace Transformers.
- Running on CPU only or targeting very low VRAM: llama.cpp with Q4_K_M GGUF.
- Serving multiple concurrent users in production: vLLM with OpenAI-compatible endpoint.
- No terminal experience, GUI preferred: LM Studio (AppImage on Linux).
- Need heavy math/reasoning: Swap
phi4-miniforphi4-mini-reasoningin any of the above paths.
Real-World Use Cases
IDE Code Completion via Continue.dev
Phi-4 Mini is well-suited as a local code-completion backend for the Continue.dev VS Code extension. In your ~/.continue/config.json:
{
"models": [{
"title": "Phi-4 Mini (Local)",
"provider": "ollama",
"model": "phi4-mini",
"apiBase": "http://localhost:11434"
}]
}
Text Summarization
messages = [
{"role": "system", "content": "Summarize the following text in two sentences."},
{"role": "user", "content": "Artificial intelligence is transforming..."},
]
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe(messages, max_new_tokens=100, temperature=0.0, do_sample=False)
print(result[0]["generated_text"][-1]["content"])
Retrieval-Augmented Generation (RAG)
Phi-4 Mini's factual knowledge is limited by its training cutoff (June 2024) and 3.8B parameter count. For production use cases that require accurate retrieval of domain knowledge, pair it with a RAG pipeline. The model's 128K context window accommodates large retrieved chunks without truncation.
Common Pitfalls and Troubleshooting
Flash Attention Fails on Older GPUs
Symptom: RuntimeError: FlashAttention only supports Ampere GPUs or newer or similar.
Fix: Pass attn_implementation="eager" to from_pretrained(). This is required for NVIDIA V100 and all generations prior to Ampere (A100/RTX 30-series).
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-mini-instruct",
attn_implementation="eager",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
CUDA Out of Memory
Symptom: RuntimeError: CUDA out of memory. Tried to allocate X GiB.
Fixes in order of impact:
- Switch to a GGUF quantization via llama.cpp or Ollama. Q4_K_M fits in 2.5 GB VRAM.
- Enable 4-bit loading via bitsandbytes (
load_in_4bit=True) — reduces HF Transformers VRAM to ~2.5 GB. - Set
OLLAMA_NUM_PARALLEL=1if using Ollama with concurrent requests. - Free GPU memory held by other processes:
nvidia-smito identify them, thenkill <pid>.
Ollama Version Too Old
Symptom: Model fails to pull or runs with incorrect outputs.
Fix: The phi4-mini tag requires Ollama 0.5.13 or later. Update with curl -fsSL https://ollama.com/install.sh | sh — the installer upgrades in-place.
Slow CPU Inference
Symptom: 1–3 tokens/second on CPU, making interactive use painful.
Fixes:
- Use a smaller quantization: Q3_K_M (2.12 GB) or Q2_K (1.68 GB).
- Increase CPU thread count in llama.cpp with
-t $(nproc). - For sustained CPU workloads, consider
numactl --cpunodebind=0 --membind=0to keep memory access local on multi-socket machines.
Model Repeats Itself in Long Conversations
Symptom: After 10+ turns, responses become repetitive or contradictory.
Fix: This is a known limitation of the 3.8B model size acknowledged in the official model card. Trim older messages from the context window or start a fresh session for long tasks.
FAQ
How much VRAM do I need to run Phi-4 Mini?
As little as 2.5 GB for the Q4_K_M GGUF via Ollama or llama.cpp. For the full BF16 HuggingFace model, you need approximately 8 GB. If you have 4–6 GB VRAM, Q4_K_M or Q5_K_M GGUF is your sweet spot.
What is the difference between Phi-4-mini-instruct and Phi-4-mini-reasoning?
They share the same 3.8B parameter base but differ in fine-tuning. The instruct variant is a general-purpose model. The reasoning variant was trained on 30B tokens of synthetic math problems from DeepSeek-R1, raising MATH-500 from 71.8% to 94.6% — at the cost of reduced performance on non-math tasks. Use reasoning only if your workload is primarily mathematical or logical.
What is Phi-4-mini-flash-reasoning?
Released July 2025, it uses a hybrid architecture that matches Phi-4-mini-reasoning's math accuracy (92.45% on MATH-500) while running up to 10× faster and with lower latency. The context window is 64K instead of 128K. It is currently available only via HuggingFace (not Ollama). It targets edge and mobile inference scenarios.
Can I use Phi-4 Mini commercially?
Yes. All three Phi-4 Mini variants are released under the MIT license, which permits unrestricted commercial use.
Can I run it without a GPU?
Yes, via llama.cpp with GGUF quantization. CPU-only inference at Q4_K_M generates roughly 10–20 tokens/second on a modern 8-core CPU — functional for batch processing and developer testing, but slow for interactive chat.
How does Phi-4 Mini compare to the full Phi-4 (14B)?
Phi-4 (14B) scores higher across all benchmarks — roughly 78% MMLU vs 67.3% for Mini. The trade-off is 4–5× higher VRAM and slower inference. Phi-4 Mini is the right choice when you need to run locally on consumer hardware; Phi-4 14B is better suited for a dedicated GPU server.
Does the 128K context window work in practice?
For Ollama and HuggingFace Transformers, yes — but long contexts increase VRAM and reduce throughput significantly. At 128K tokens, a full BF16 load requires 16–24 GB VRAM. For practical work, 8–32K token contexts are more common and much faster.
Does it run on AMD GPUs?
Yes, as of ROCm 7.2 (March 2026). Install ROCm 7.2, then use Ollama or llama.cpp with the ROCm backend — no custom patches needed. vLLM also supports AMD via ROCm 7.2 as of vLLM v0.16.0 (February 2026).
References and Further Reading
- microsoft/Phi-4-mini-instruct — Official model card, specs, and benchmark scores (Hugging Face)
- microsoft/Phi-4-mini-reasoning — Reasoning variant model card (Hugging Face)
- microsoft/Phi-4-mini-flash-reasoning — Flash-reasoning variant model card (Hugging Face)
- Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs (arXiv 2503.01743)
- Welcome to the new Phi-4 models — Microsoft Tech Community blog post
- bartowski/microsoft_Phi-4-mini-instruct-GGUF — GGUF quantization variants with size and quality comparisons (Hugging Face)
- phi4-mini — Ollama library page with install commands
- Ollama Linux installation documentation