Run Microsoft Phi-4 Mini on Linux: Step-by-Step Guide 2026

Last updated April 2026 — refreshed for current model/tool versions.

Microsoft's Phi-4 Mini is a 3.8-billion-parameter language model that punches well above its weight class: it matches Llama 3.1 8B on MMLU, beats it on MATH (62% vs 52%), and runs comfortably on a consumer GPU with just 4 GB of VRAM. This guide covers every practical path to running it on Linux — Ollama, HuggingFace Transformers, vLLM, and llama.cpp with GGUF — including the 2025/2026 reasoning variants that didn't exist when the original post was written.

What changed since early 2025Phi-4-mini-reasoning (April 2025): A math-specialist fine-tune that raises AIME accuracy from 10.0 to 57.5 and MATH-500 from 71.8 to 94.6. It is a separate model, not an update to the base instruct model.Phi-4-mini-flash-reasoning (July 2025): Hybrid architecture that delivers up to 10× higher decode throughput vs. the reasoning variant while scoring 92.45% on MATH-500 — and it runs on 64K context instead of 128K.Ollama minimum version: The phi4-mini Ollama tag now requires Ollama 0.5.13 or later. Older installs will fail silently.Flash attention default changed: The HuggingFace model now enables flash attention by default. NVIDIA V100 and earlier GPUs must explicitly pass attn_implementation="eager".ROCm parity (March 2026): ROCm 7.2 achieves out-of-the-box parity with CUDA for Ollama, llama.cpp, and vLLM — AMD GPU users no longer need custom patches.

Attribute	Value
Parameters	3.8 B
Architecture	Dense decoder-only Transformer (32 layers, LongRoPE)
Context length	128K tokens
Vocabulary	200,064 tokens (multilingual)
License	MIT (commercial use OK)
Release date	February 2025
VRAM (Q4_K_M GGUF)	~2.5 GB
VRAM (full BF16)	~8 GB
MMLU (5-shot)	67.3%
HumanEval (0-shot)	74.4%
MATH benchmark	64.0%

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

The Phi-4 Family in 2026: Which Model Do You Need?

The "Phi-4 Mini" name now covers three distinct models. Picking the wrong one is the most common mistake new users make.

Model	Best for	Context	MATH-500	Ollama tag
Phi-4-mini-instruct	General chat, code, instruction following	128K	64.0%	`phi4-mini`
Phi-4-mini-reasoning	Complex multi-step math and logic	128K	94.6%	`phi4-mini-reasoning`
Phi-4-mini-flash-reasoning	Math at edge/mobile speed (10× faster than reasoning)	64K	92.45%	not yet on Ollama; HF only

Unless you need heavy mathematical reasoning, start with phi4-mini (the instruct variant). If you are building on a local AI agent stack, the OpenClaw + Ollama setup guide for running local AI agents covers integrating Phi-4 Mini into a complete agent pipeline with tool use and memory.

Prerequisites and Hardware Requirements

Hardware

GPU (recommended): Any NVIDIA GPU with ≥4 GB VRAM (Turing/RTX 20-series or newer) for full-speed inference. On a RTX 4090, expect ~300 tokens/second at Q4_K_M quantization.
GPU flash attention support: A100, A6000, H100 — these use the default flash_attention_2 backend. V100 and older must pass attn_implementation="eager".
AMD GPU: ROCm 7.2 (March 2026) provides full parity with CUDA for Ollama, llama.cpp, and vLLM.
CPU-only: Supported via llama.cpp. A modern 8-core CPU generates ~10–20 tokens/second at Q4_K_M — usable for batch workloads, slow for interactive chat.
RAM: Minimum 8 GB system RAM. 16 GB recommended if running the full BF16 model through HuggingFace Transformers.
Disk: ~2.5 GB for the Q4_K_M GGUF; ~8 GB for the full BF16 HuggingFace weights.

Software

Linux (Ubuntu 22.04+, Debian 12+, Fedora 39+, Arch — all work)
Python 3.10 or later
NVIDIA driver ≥ 525 and CUDA 12.x (for GPU paths)
Ollama 0.5.13 or later (for the Ollama path)

Method 1: Ollama (Fastest Path, Recommended for Most Users)

Ollama wraps model download, quantization selection, and serving into a single command. It is the fastest way to get Phi-4 Mini running interactively.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama --version   # confirm 0.5.13 or later

After installation, Ollama runs as a systemd service on port 11434. Verify it is up:

systemctl status ollama

Pull and Run Phi-4 Mini

# Instruct model (general use) — 2.5 GB download
ollama pull phi4-mini

# Start an interactive session
ollama run phi4-mini

For the math-reasoning variant:

ollama pull phi4-mini-reasoning
ollama run phi4-mini-reasoning

Call via REST API

curl http://localhost:11434/api/chat -d '{
  "model": "phi4-mini",
  "messages": [
    {"role": "user", "content": "Write a Python function to binary search a sorted list."}
  ]
}'

Run Ollama in Docker

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

docker exec -it ollama ollama run phi4-mini

Method 2: HuggingFace Transformers (Full Control)

Use this path when you need function calling, custom system prompts, pipeline integration, or programmatic control over generation parameters.

Install Dependencies

pip install torch==2.5.1 \
            transformers==4.49.0 \
            accelerate==1.3.0 \
            flash_attn==2.7.4.post1

If flash_attn fails to install (common on older GPUs), omit it — you will pass attn_implementation="eager" instead.

Basic Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "microsoft/Phi-4-mini-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",          # spreads across available GPUs/CPU
    torch_dtype="auto",         # bfloat16 on modern GPUs
    trust_remote_code=True,
    # attn_implementation="eager"  # uncomment for V100 or older GPUs
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain what LongRoPE positional encoding does."},
]

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe(messages, max_new_tokens=500, temperature=0.0, do_sample=False)
print(output[0]["generated_text"][-1]["content"])

Function Calling (New in Phi-4 Mini)

Phi-4 Mini adds native function-calling support that Phi-3.5 Mini lacked. Use the <|tool|> block in the system prompt:

system = """You are a helpful assistant with some tools.<|tool|>[
  {
    "name": "get_weather",
    "description": "Return current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "City name"}
      },
      "required": ["city"]
    }
  }
]</tool|>"""

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": "What is the weather in Berlin?"},
]

4-Bit Quantization via bitsandbytes (Low-VRAM GPU)

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-mini-instruct",
    quantization_config=quant_config,
    device_map="auto",
    trust_remote_code=True,
)

Method 3: llama.cpp with GGUF (CPU-Friendly, Maximum Portability)

llama.cpp is ideal when you want CPU-only inference, the smallest possible binary, or maximum compatibility across hardware.

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# GPU build (NVIDIA CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)

Download a GGUF Quantization

The Q4_K_M quantization is the best default for most users: 2.49 GB, minimal quality loss. For very low VRAM, Q3_K_M at 2.12 GB is the next step down.

pip install -U "huggingface_hub[cli]"

# Download Q4_K_M (recommended default)
huggingface-cli download bartowski/microsoft_Phi-4-mini-instruct-GGUF \
  --include "microsoft_Phi-4-mini-instruct-Q4_K_M.gguf" \
  --local-dir ./phi4-mini-gguf/

Quantization Guide

Quantization	Size	Quality	Recommended when
Q8_0	4.08 GB	Near-lossless	You have ≥6 GB VRAM and want maximum fidelity
Q6_K	3.16 GB	Very high	5 GB VRAM, near-perfect quality
Q5_K_M	2.85 GB	High	4–5 GB VRAM, excellent quality/size tradeoff
Q4_K_M	2.49 GB	Good (default)	Most users — 4 GB VRAM or 8 GB RAM (CPU)
Q3_K_M	2.12 GB	Medium-low	Very tight RAM; expect some degradation
Q2_K	1.68 GB	Low but usable	Emergency RAM constraint only

Run Inference

# Interactive chat (CPU)
./build/bin/llama-cli \
  -m ./phi4-mini-gguf/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf \
  -p "<|system|>You are a helpful AI assistant.<|end|><|user|>Explain quantization.<|end|><|assistant|>" \
  -n 512

# GPU-accelerated (offload all layers to CUDA)
./build/bin/llama-cli \
  -m ./phi4-mini-gguf/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf \
  -ngl 999 \
  -p "<|system|>You are a helpful assistant.<|end|><|user|>Write a Rust function to parse JSON.<|end|><|assistant|>" \
  -n 512

Method 4: vLLM (High-Throughput Production Serving)

vLLM is the right choice when you need to serve multiple concurrent users, need OpenAI-compatible API endpoints, or need maximum throughput from a GPU server. It uses PagedAttention for efficient KV-cache management.

pip install vllm

from vllm import LLM, SamplingParams

llm = LLM(model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
sampling_params = SamplingParams(max_tokens=500, temperature=0.0)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the main use cases for small language models?"},
]

output = llm.chat(messages=messages, sampling_params=sampling_params)
print(output[0].outputs[0].text)

Serve an OpenAI-compatible endpoint:

vllm serve microsoft/Phi-4-mini-instruct \
  --trust-remote-code \
  --port 8000

Method 5: LM Studio (GUI, No Terminal Required)

LM Studio provides a desktop GUI for Linux (AppImage), Windows, and macOS. It downloads GGUF models directly from Hugging Face.

Download the LM Studio AppImage from lmstudio.ai.
Search for Phi-4-mini-instruct in the model catalog.
Select Q4_K_M and click Download.
Click Run to start the local server.

LM Studio exposes an OpenAI-compatible API on localhost:1234, so any tool that speaks OpenAI can point to it.

Chat Prompt Format

If you are using llama.cpp directly, the template is:

<|system|>{system message}<|end|><|user|>{user message}<|end|><|assistant|>

If you are using HuggingFace Transformers, tokenizer.apply_chat_template(messages) handles this automatically.

Performance and Benchmarks

The following numbers are from the official Hugging Face model card (February 2025) and the Phi-4 technical reports. Do not treat these as absolute rankings — scores vary by prompt format and evaluation harness.

Benchmark	Phi-4-mini-instruct (3.8B)	Llama 3.2 3B	Qwen2.5-7B	GPT-4o-mini
MMLU (5-shot)	67.3%	~58%	~74%	82%
HumanEval (0-shot)	74.4%	~55%	~72%	87%
MATH	64.0%	~48%	~69%	~80%
GSM8K	88.6%	~77%	~91%	~97%
BigBench Hard	70.4%	—	—	—
ARC Challenge	83.7%	~78%	~85%	~96%

Inference speed (RTX 4090, Q4_K_M GGUF): approximately 300 tokens/second — roughly 2× faster than running a 7–8B model at similar quantization on the same hardware.

Reasoning Variants: Benchmark Comparison

Model	AIME 2024	MATH-500	GPQA Diamond
Phi-4-mini-instruct	10.0	71.8%	36.9%
Phi-4-mini-reasoning	57.5	94.6%	52.0%
Phi-4-mini-flash-reasoning	~52%	92.45%	—
o1-mini (reference)	63.6	90.0%	60.0%

How to Choose Your Deployment Method

Just want to try it, interactive chat: Ollama (ollama run phi4-mini).
Need programmatic control, function calling, or pipeline integration: HuggingFace Transformers.
Running on CPU only or targeting very low VRAM: llama.cpp with Q4_K_M GGUF.
Serving multiple concurrent users in production: vLLM with OpenAI-compatible endpoint.
No terminal experience, GUI preferred: LM Studio (AppImage on Linux).
Need heavy math/reasoning: Swap phi4-mini for phi4-mini-reasoning in any of the above paths.

Real-World Use Cases

IDE Code Completion via Continue.dev

Phi-4 Mini is well-suited as a local code-completion backend for the Continue.dev VS Code extension. In your ~/.continue/config.json:

{
  "models": [{
    "title": "Phi-4 Mini (Local)",
    "provider": "ollama",
    "model": "phi4-mini",
    "apiBase": "http://localhost:11434"
  }]
}

Text Summarization

messages = [
    {"role": "system", "content": "Summarize the following text in two sentences."},
    {"role": "user", "content": "Artificial intelligence is transforming..."},
]
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe(messages, max_new_tokens=100, temperature=0.0, do_sample=False)
print(result[0]["generated_text"][-1]["content"])

Retrieval-Augmented Generation (RAG)

Phi-4 Mini's factual knowledge is limited by its training cutoff (June 2024) and 3.8B parameter count. For production use cases that require accurate retrieval of domain knowledge, pair it with a RAG pipeline. The model's 128K context window accommodates large retrieved chunks without truncation.

Common Pitfalls and Troubleshooting

Flash Attention Fails on Older GPUs

Symptom: RuntimeError: FlashAttention only supports Ampere GPUs or newer or similar.

Fix: Pass attn_implementation="eager" to from_pretrained(). This is required for NVIDIA V100 and all generations prior to Ampere (A100/RTX 30-series).

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-mini-instruct",
    attn_implementation="eager",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

CUDA Out of Memory

Symptom: RuntimeError: CUDA out of memory. Tried to allocate X GiB.

Fixes in order of impact:

Switch to a GGUF quantization via llama.cpp or Ollama. Q4_K_M fits in 2.5 GB VRAM.
Enable 4-bit loading via bitsandbytes (load_in_4bit=True) — reduces HF Transformers VRAM to ~2.5 GB.
Set OLLAMA_NUM_PARALLEL=1 if using Ollama with concurrent requests.
Free GPU memory held by other processes: nvidia-smi to identify them, then kill <pid>.

Ollama Version Too Old

Symptom: Model fails to pull or runs with incorrect outputs.

Fix: The phi4-mini tag requires Ollama 0.5.13 or later. Update with curl -fsSL https://ollama.com/install.sh | sh — the installer upgrades in-place.

Slow CPU Inference

Symptom: 1–3 tokens/second on CPU, making interactive use painful.

Fixes:

Use a smaller quantization: Q3_K_M (2.12 GB) or Q2_K (1.68 GB).
Increase CPU thread count in llama.cpp with -t $(nproc).
For sustained CPU workloads, consider numactl --cpunodebind=0 --membind=0 to keep memory access local on multi-socket machines.

Model Repeats Itself in Long Conversations

Symptom: After 10+ turns, responses become repetitive or contradictory.

Fix: This is a known limitation of the 3.8B model size acknowledged in the official model card. Trim older messages from the context window or start a fresh session for long tasks.

FAQ

How much VRAM do I need to run Phi-4 Mini?

As little as 2.5 GB for the Q4_K_M GGUF via Ollama or llama.cpp. For the full BF16 HuggingFace model, you need approximately 8 GB. If you have 4–6 GB VRAM, Q4_K_M or Q5_K_M GGUF is your sweet spot.

What is the difference between Phi-4-mini-instruct and Phi-4-mini-reasoning?

They share the same 3.8B parameter base but differ in fine-tuning. The instruct variant is a general-purpose model. The reasoning variant was trained on 30B tokens of synthetic math problems from DeepSeek-R1, raising MATH-500 from 71.8% to 94.6% — at the cost of reduced performance on non-math tasks. Use reasoning only if your workload is primarily mathematical or logical.

What is Phi-4-mini-flash-reasoning?

Released July 2025, it uses a hybrid architecture that matches Phi-4-mini-reasoning's math accuracy (92.45% on MATH-500) while running up to 10× faster and with lower latency. The context window is 64K instead of 128K. It is currently available only via HuggingFace (not Ollama). It targets edge and mobile inference scenarios.

Can I use Phi-4 Mini commercially?

Yes. All three Phi-4 Mini variants are released under the MIT license, which permits unrestricted commercial use.

Can I run it without a GPU?

Yes, via llama.cpp with GGUF quantization. CPU-only inference at Q4_K_M generates roughly 10–20 tokens/second on a modern 8-core CPU — functional for batch processing and developer testing, but slow for interactive chat.

How does Phi-4 Mini compare to the full Phi-4 (14B)?

Phi-4 (14B) scores higher across all benchmarks — roughly 78% MMLU vs 67.3% for Mini. The trade-off is 4–5× higher VRAM and slower inference. Phi-4 Mini is the right choice when you need to run locally on consumer hardware; Phi-4 14B is better suited for a dedicated GPU server.

Does the 128K context window work in practice?

For Ollama and HuggingFace Transformers, yes — but long contexts increase VRAM and reduce throughput significantly. At 128K tokens, a full BF16 load requires 16–24 GB VRAM. For practical work, 8–32K token contexts are more common and much faster.

Does it run on AMD GPUs?

Yes, as of ROCm 7.2 (March 2026). Install ROCm 7.2, then use Ollama or llama.cpp with the ROCm backend — no custom patches needed. vLLM also supports AMD via ROCm 7.2 as of vLLM v0.16.0 (February 2026).