How to Run DeepSeek-R1-0528 Locally: Ollama, vLLM, LM Studio & MLX Guide (2026)

How to Run DeepSeek-R1-0528 Locally: Ollama, vLLM, LM Studio & MLX Guide (2026)

Last updated April 2026 — refreshed for current model/tool versions.

What changed in 2026 — read this first if you visited before:DeepSeek's model lineup has advanced significantly. R1-0528 remains valid for its release (May 2025), but the ecosystem has moved forward: DeepSeek-V3.1 (August 2025), V3.2 (December 2025), and now DeepSeek-V4 / V4-Pro (April 24, 2026) are the current frontier models. If you need the latest and greatest rather than R1-0528 specifically, see the newer guide at the end.Ollama now natively lists DeepSeek-V3.1, V3.2, V4-Flash, and V4-Pro in its library. The deepseek-r1:8b and deepseek-r1:671b tags were silently updated to point to 0528 weights. Pull again to refresh.LM Studio and Jan both gained native DeepSeek support with one-click model download — no manual GGUF merging needed for most sizes.Apple Silicon (M3/M4) users can now run the 8B distilled model at 40–182 tokens/second via MLX, which is 2–3× faster than GGUF/Ollama on the same hardware.System prompt is now supported in R1-0528 (it was not in the original R1). The forced <think>\n prefix is no longer required.API model names are changing: deepseek-reasoner and deepseek-chat are deprecated as of April 2026 and will be removed July 24, 2026. Use deepseek-r1-0528 or the V4 family going forward.

DeepSeek-R1-0528 is a 685-billion-parameter open-source reasoning model released May 28, 2025. It is the most capable version of the R1 series and can run on self-hosted hardware — no cloud subscription required. This guide covers every local installation path: Ollama (easiest), LM Studio (best GUI), vLLM (production throughput), llama.cpp (maximum control), and Apple Silicon via MLX. It also includes honest hardware requirements, benchmark numbers from official sources, a quantization selector, and a troubleshooting section drawn from real community experience.


Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — Quick-Start Table

Goal Best tool Minimum RAM/VRAM Command to start
Try it now, low friction Ollama + 8B distilled 8 GB RAM ollama run deepseek-r1:8b
GUI with model browser LM Studio 8 GB RAM Search "deepseek-r1-0528" in LM Studio
Full 671B model (consumer) Ollama + Q2_K_XL GGUF 180 GB RAM + 24 GB GPU See full-model section below
Production API server vLLM 2× A100 80 GB vllm serve deepseek-ai/DeepSeek-R1-0528
Apple Silicon (M3/M4) MLX or Ollama 16 GB unified memory ollama run deepseek-r1:8b

DeepSeek-R1-0528: What It Is

DeepSeek-R1-0528 is a reasoning-focused LLM built on the DeepSeek-V3 architecture. The "0528" suffix is a date stamp — May 28, 2025 — marking a significant update to the original R1 weights. Key traits:

  • 685B parameters (up from the original 671B counting — both figures appear in official docs; 685B is from the Hugging Face model card). Architecture: Mixture-of-Experts (MoE) with BF16/F8 tensor types.
  • 128K context window, 64K max output tokens.
  • MIT License — commercial use, fine-tuning, and distillation are all permitted.
  • Chain-of-thought reasoning baked into the model via reinforcement learning, not prompted. The model exposes its reasoning in <think>...</think> blocks before the final answer.
  • Improved features vs original R1: system prompt support, function calling, JSON output mode, reduced hallucination rate, no forced <think> prefix.

The Bigger Picture: DeepSeek Since R1-0528

R1-0528 was released in May 2025. The DeepSeek team has shipped several major updates since:

Model Released Key advance
DeepSeek-R1-0528 May 2025 This post's subject — enhanced reasoning R1
DeepSeek-V3.1 August 2025 Hybrid thinking/non-thinking, improved efficiency
DeepSeek-V3.2 December 2025 Further performance gains on coding & math
DeepSeek-V4-Flash April 24, 2026 284B total / 13B active, 1M context, MoE
DeepSeek-V4-Pro April 24, 2026 1.6T total / 49B active, frontier reasoning, 1M context

If you need the model for a specific May 2025 reproducibility or cost comparison use case, R1-0528 is the right model. For a new project starting today, evaluate V3.2 or V4-Flash — both are available in Ollama's library.


Performance Benchmarks (Official Numbers)

These figures are from the official DeepSeek-R1-0528 Hugging Face model card and the DeepSeek API changelog. Do not compare these to later models without sourcing separately.

Benchmark Original R1 R1-0528 Improvement
AIME 2024 79.8% 91.4% +11.6 pp
AIME 2025 70.0% 87.5% +17.5 pp
HMMT 2025 41.7% 79.4% +37.7 pp
LiveCodeBench 63.5% 73.3% +9.8 pp
Codeforces (Div1 rating) 1530 1930 +400 pts
MMLU-Redux 92.9% 93.4% +0.5 pp
GPQA-Diamond 71.5% 81.0% +9.5 pp
Tau-bench (Airline/Retail) 53.5 / 63.9 New metric (function calling)

The HMMT 2025 gain (+37.7 pp) and AIME 2025 gain (+17.5 pp) are the standout improvements. The model achieves this by reasoning more deeply — average token usage per AIME question increased from ~12K to ~23K tokens. That means answers are slower but substantially more accurate on hard math and logic tasks.


Hardware Requirements

The 685B full model is a serious machine learning workload. Quantized distilled variants are the practical choice for most developers. Here is a concrete breakdown:

Variant Params GGUF size (Q4_K_M) Minimum RAM/VRAM Expected speed
1.5B distilled 1.5B ~1.1 GB 4 GB 50–100 tok/s on CPU
7B distilled (Qwen) 7B ~4.7 GB 8 GB 30–60 tok/s on RTX 3060
8B distilled (Qwen3) 8B ~5.2 GB 8 GB 40–182 tok/s on M4 Max (MLX)
14B distilled 14B ~9.0 GB 16 GB 20–40 tok/s on RTX 4090
32B distilled 32B ~20 GB 24 GB VRAM 10–25 tok/s on RTX 4090
70B distilled (Llama) 70B ~43 GB 48 GB VRAM 5–15 tok/s on 2× A6000
671B full (Q2_K_XL) 671B ~217 GB 180 GB RAM + 24 GB GPU ~3 tok/s hybrid CPU/GPU
671B full (Q4_K_M) 671B ~400 GB 512 GB unified (M3 Ultra) ~15 tok/s on M3 Ultra

Rule of thumb: For general-purpose development and coding assistance, the 14B or 32B distilled models hit the best quality-to-speed ratio on a typical developer workstation. Save the full 671B for evaluation work or production inference clusters.

Choosing a Quantization Level (Full 671B Model)

If you are running the full model locally, Unsloth provides the most complete GGUF quantization suite. Their Dynamic 2.0 method selectively quantizes MoE layers more aggressively than attention layers, preserving accuracy:

Quant Size Best for
UD-IQ1_S / UD-IQ1_M 162–200 GB Single 24GB GPU + large CPU RAM (≥180 GB)
UD-IQ2_XXS / UD-Q2_K_XL 217–251 GB Recommended balance: accuracy vs size for CPU inference
Q4_K_M ~400 GB High-end workstations, Mac Studio M3/M4 Ultra
Q8_0 ~713 GB Multi-GPU server clusters only

Download from unsloth/DeepSeek-R1-0528-GGUF on Hugging Face.


Method 1: Ollama (Fastest Setup)

Ollama is the fastest path from zero to a running model. It handles download, quantization selection, and serving in a single command. If you are exploring local AI agents more broadly, the OpenClaw + Ollama setup guide for running local AI agents covers the full agentic workflow on top of Ollama's API.

Install Ollama

Linux / macOS:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Ollama runs as a background service after installation.

Verify your GPU is detected:

nvidia-smi   # NVIDIA
rocm-smi     # AMD ROCm
# Pull and run in one step
ollama run deepseek-r1:8b

# Or pull first, run later
ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b

Available distilled sizes directly from Ollama's library (all updated to 0528 weights):

ollama run deepseek-r1:1.5b   # 1.1 GB — CPU-only capable
ollama run deepseek-r1:7b     # 4.7 GB — 8 GB RAM minimum
ollama run deepseek-r1:8b     # 5.2 GB — default, Qwen3-based
ollama run deepseek-r1:14b    # 9.0 GB — strong reasoning
ollama run deepseek-r1:32b    # 20 GB  — near-full quality
ollama run deepseek-r1:70b    # 43 GB  — requires 48+ GB VRAM
ollama run deepseek-r1:671b   # 404 GB — full model

Run the Full 671B Model via Unsloth GGUF

For the full model on a consumer machine using CPU offloading, use the Unsloth quantized GGUF via the Hugging Face integration:

# Q2 quantization — fits in ~217 GB RAM (with GPU offloading)
ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:UD-Q2_K_XL

# Q4 quantization — better quality, needs ~400 GB
ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q4_K_M

# 8B Qwen3-based distill — great quality, 8 GB minimum
ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL

Use Ollama's OpenAI-Compatible API

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:8b",
    "messages": [{"role": "user", "content": "Explain binary search trees."}],
    "temperature": 0.6
  }'

Method 2: LM Studio (Best GUI Experience)

LM Studio provides a desktop app for Windows, macOS, and Linux with a built-in model browser, chat interface, and local API server. It is the recommended option for developers who prefer a visual workflow.

  1. Download LM Studio from lmstudio.ai.
  2. Open the Discover tab and search deepseek-r1-0528.
  3. Select a quantization matching your RAM (Q4_K_M for balance; Q2_K for large RAM / lower VRAM).
  4. Click Download. LM Studio handles sharded downloads automatically.
  5. Load the model, then switch to the Chat tab or enable the Local Server on port 1234 for API access.

Apple Silicon note: LM Studio uses its own MLX-accelerated engine on M-series Macs. The M3 Ultra Mac Studio achieved 15.74 tokens/second on the Q4_K_M 685B model via LM Studio in independent testing. The M4 Max achieves 182 tokens/second on the 8B model with MLX.


Method 3: vLLM (Production Inference Server)

vLLM is the standard for high-throughput production deployments. It supports continuous batching, tensor parallelism across multiple GPUs, and an OpenAI-compatible API. Use it when you need to serve multiple concurrent users or need the highest throughput.

Requirements

  • Python 3.10+
  • CUDA 12.4+ (NVIDIA GPU required)
  • For the 671B full model: 2× A100 80 GB or equivalent (≥160 GB total VRAM)
  • For distilled 8B/14B/32B: a single RTX 3090/4090 or A100 is sufficient

Install and Serve

# Create a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

# Serve the 8B distilled model (single GPU)
vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
  --tokenizer deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
  --tensor-parallel-size 1

# Serve the full 671B model (multi-GPU, tensor parallel)
vllm serve deepseek-ai/DeepSeek-R1-0528 \
  --tokenizer deepseek-ai/DeepSeek-R1-0528 \
  --tensor-parallel-size 4 \
  --max-model-len 32768

Query the vLLM Server

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    messages=[{"role": "user", "content": "Explain transformer attention in 3 sentences."}],
    temperature=0.6,
    max_tokens=4096
)
print(response.choices[0].message.content)

The vLLM server exposes an OpenAI-compatible endpoint. Any existing OpenAI SDK client works by changing base_url — no other code changes required.


Method 4: Hugging Face Transformers (Maximum Control)

Use Transformers when you need programmatic access to model internals, custom sampling, or integration with a fine-tuning pipeline. This method downloads the BF16 weights (~1.34 TB for the full model), so it is practical only for the 8B distilled variant on most machines.

Install

pip install torch transformers accelerate

Load and Run (8B Distilled)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # distributes across available GPUs/CPU
)

messages = [
    {"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Memory note: device_map="auto" with torch_dtype=torch.bfloat16 will use approximately 16 GB VRAM for the 8B model. Add load_in_4bit=True from the bitsandbytes library to reduce this to ~6 GB.


Method 5: Apple Silicon via MLX

On Apple M3 and M4 chips, Apple's MLX framework is 2–3× faster than GGUF (Ollama/llama.cpp) because it exploits unified memory without PCIe transfer overhead. For the 8B model on a MacBook with 16 GB RAM, MLX is the best choice.

pip install mlx-lm

# Run the 8B distilled model
mlx_lm.generate \
  --model mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit \
  --prompt "Explain recursive descent parsing." \
  --max-tokens 1024 \
  --temp 0.6

Performance on tested hardware (from community and vendor reports):

  • M4 Max (128 GB): 182 tokens/second on 8B model
  • M3 Ultra (512 GB): ~15 tokens/second on full Q4_K_M 685B model
  • M3 (16–24 GB): 54 tokens/second on 8B model
  • M1 (16 GB): 40 tokens/second on 8B model

If you prefer the Ollama workflow on a Mac, ollama run deepseek-r1:8b works fine — it will automatically use the Metal backend on Apple Silicon. MLX is the option when you need the absolute highest throughput.


Prompt Engineering for R1-0528

R1-0528 uses a chain-of-thought format. The model outputs a <think> block with its reasoning before the final answer. Configure these settings for best results:

System Prompt

System prompt support is new in 0528 (not available in original R1). Use English or Chinese:

You are DeepSeek-R1, created by DeepSeek.
Today is {current_date}.

The original post contained a Chinese-language system prompt with Chinese punctuation (fullwidth period ). That is valid — the model is bilingual — but for English-language deployments use the English version above to avoid confusion.

temperature: 0.6      # Lower = less repetition; do not use default 1.0
top_p: 0.95
max_new_tokens: 64000  # Full context output
context_size: 128000   # Full context window

File Upload Template

file_template = """[file name]: {file_name}
[file content begin]
{file_content}
[file content end]
{question}"""

Web Search Template

search_answer_en_template = """# The following contents are the search results related to the user's message:
{search_results}

# The user's message is:
{question}"""

How to Choose: Decision Tree

  • I have < 16 GB RAM total → Use the 1.5B or 7B distilled model with Ollama. The full 671B model is not feasible.
  • I have 16–32 GB RAM, no discrete GPU → Run the 8B or 14B distilled model with Ollama or LM Studio. On Apple Silicon, prefer MLX for speed.
  • I have a 24 GB VRAM GPU (RTX 3090/4090) → Run the 32B distilled model in vLLM or Ollama. For the full 671B: use Q2_K_XL GGUF with CPU offloading (3–5 tok/s).
  • I have 2–4× A100/H100 GPUs → Use vLLM with tensor parallelism on the full 671B model for production-grade throughput.
  • I want a GUI, not command line → LM Studio. Runs on Windows, macOS, Linux.
  • I need an OpenAI-compatible API server locally → Ollama (port 11434) or vLLM (port 8000). Both work with any OpenAI SDK client.
  • I need to fine-tune or modify inference internals → Hugging Face Transformers.
  • I'm building something new and don't need R1-0528 specifically → Consider DeepSeek-V3.2 or V4-Flash; both are newer and available in Ollama.

Teams scaling from local prototypes to production infrastructure often find value in having vetted engineers who know these deployment stacks — Codersera's vetted remote developer hiring service specializes in matching companies with engineers experienced in AI/ML infrastructure.


Common Pitfalls and Troubleshooting

Out of Memory Errors

  • Symptom: CUDA out of memory or the process is killed by the OS.
  • Fix (GPU): Switch to a lower quantization (e.g., Q4 → Q2) or use a smaller distilled model.
  • Fix (CPU): Reduce --ctx-size / max-model-len. A 128K context window reserves a large KV cache — drop to 8K–32K if you don't need full context.
  • Fix (vLLM): Add --gpu-memory-utilization 0.90 to leave headroom, or reduce --max-model-len.

Slow Inference (Full 671B Model)

  • CPU-only inference on the full model runs at ~0.5–1 token/second. This is expected — the 671B model is not practical for interactive use without significant GPU offloading or multiple high-VRAM GPUs.
  • Use --n-gpu-layers 99 in llama.cpp to offload as many layers as possible to GPU. Even a single 8 GB GPU will meaningfully speed up a CPU-heavy run.
  • On a machine with mixed RAM + VRAM, aim for the Q2_K_XL quant: 217 GB total, with the GPU handling top layers.

Dependency and Compatibility Errors

  • CUDA version mismatch: Verify with nvcc --version and nvidia-smi. PyTorch and vLLM both require CUDA ≥ 12.1. Install matching CUDA toolkit from developer.nvidia.com/cuda-downloads.
  • Transformers version: The DeepSeek-R1-0528-Qwen3-8B model requires transformers ≥ 4.51.0. Run pip install --upgrade transformers.
  • Ollama model not updating: Run ollama pull deepseek-r1:8b explicitly; the :latest tag will not re-pull if a cached version exists.

Model Outputs Raw <think> Tags

This is expected behavior — R1-0528 uses chain-of-thought reasoning that is visible in the raw output. Ollama and LM Studio both strip or collapse these tags in their UIs. If you are consuming the API directly, parse the output: the content between <think> and </think> is the reasoning trace; the final answer follows.

Chinese Punctuation in System Prompt

The original version of this post included a Chinese-language system prompt with a fullwidth period ( U+3002). This is linguistically correct for Chinese but caused copy-paste issues for English-only users. Use the English system prompt shown above unless your deployment is specifically Chinese-language.

Function Calling Not Working

Function calling is new in R1-0528 (not available in original R1). Ensure you are using the 0528 weights, not earlier R1 weights. In Ollama, run ollama list to confirm the model version. In vLLM, confirm the model path resolves to deepseek-ai/DeepSeek-R1-0528 or deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.


FAQ

Can I run DeepSeek-R1-0528 on a MacBook Air with 8 GB RAM?

Yes, with the 1.5B or 7B distilled model. The 8B Qwen3-based model requires 8–10 GB and will be very tight on an 8 GB machine. Use Ollama (ollama run deepseek-r1:7b) or LM Studio and set a short context window (4K tokens). The full 671B model is not feasible on this hardware.

What is the difference between DeepSeek-R1-0528 and DeepSeek-R1-0528-Qwen3-8B?

DeepSeek-R1-0528 is the full 685-billion-parameter model. DeepSeek-R1-0528-Qwen3-8B is a distilled version — a much smaller model trained to mimic the full model's reasoning behavior. The 8B model is practical on consumer hardware; the full model requires a small cluster or a Mac with 400+ GB unified memory.

Is DeepSeek-R1-0528 still the best reasoning model I can run locally?

As of April 2026, newer models exist: DeepSeek-V3.2 and DeepSeek-V4-Flash are both locally deployable via Ollama and offer improved capabilities over R1-0528 on many tasks. R1-0528 remains relevant for research reproducibility and for users who specifically need the original R1 reasoning training methodology.

Does DeepSeek-R1-0528 support function calling and JSON mode?

Yes — these are new capabilities added in the 0528 update that were not present in the original R1. The official function calling benchmark score is 53.5% (Airline) and 63.9% (Retail) on Tau-bench.

How long does a typical response take?

On the 8B model with a modern GPU, expect 20–60 seconds per response (30–60 tok/s). On the full 671B model with the M3 Ultra, expect 1–4 minutes (15 tok/s). On CPU-only with the 671B Q2 quant, expect 10+ minutes. The model reasons deeply before answering — token counts per question are typically 12K–23K on hard math tasks.

Can I use the DeepSeek API instead of running locally?

Yes. The DeepSeek API at platform.deepseek.com exposes an OpenAI-compatible endpoint. As of April 2026, the legacy deepseek-reasoner model name (which mapped to R1-0528) is deprecated in favor of the explicit model ID. Use deepseek-r1-0528 as the model name. The V4 family (deepseek-v4-flash, deepseek-v4-pro) is also available via API.

What happened to llama.cpp support?

llama.cpp fully supports DeepSeek-R1-0528 via the GGUF format provided by Unsloth. The basic invocation is:

./llama-cli -m DeepSeek-R1-0528-Q4_K_M.gguf \
  -p "<|User|>Your question here<|Assistant|>" \
  --temp 0.6 --top-p 0.95 -n 2048 --n-gpu-layers 99

What are the model's known limitations?

From the official model card and community testing: the model is not suitable for safety-critical applications without additional guardrails; long reasoning traces consume substantial context; and performance on very long-context tasks (>64K tokens in a single generation) degrades compared to models natively optimized for that range. For 1M-token context, see DeepSeek-V4.


References and Further Reading

  1. DeepSeek-R1-0528 — Official Model Card (Hugging Face)
  2. Unsloth DeepSeek-R1-0528-GGUF — All quantization variants
  3. DeepSeek API Changelog — Official release notes
  4. DeepSeek-R1 on Ollama — Available sizes and tags
  5. Unsloth: How to Run DeepSeek-R1-0528 Locally — Installation guide
  6. MacStories: Testing DeepSeek R1-0528 on M3 Ultra — Performance benchmarks
  7. DeepSeek-R1 Paper (arXiv 2501.12948) — Reinforcement learning methodology
  8. DeepSeek-R1-0528-Qwen3-8B — Official 8B distilled model card