Run DeepSeek-Prover-V2 7B on Linux: Complete Setup Guide

Last updated April 2026 — refreshed for current model/tool versions and 2025 ecosystem benchmarks.

DeepSeek Prover V2 7B is the most capable open-source formal theorem-proving model at the 7B parameter scale, purpose-built for generating verified proofs in Lean 4. Released in April 2025, it remains the reference deployment target for researchers who need a self-hosted prover that can actually pass peer-review-grade mathematical benchmarks. This guide covers everything you need to run it on Linux in 2026: hardware selection, CUDA setup, HuggingFace inference, GGUF quantized deployment with llama.cpp, and vLLM serving for production workloads.

What changed in 2025–2026 that readers of older guides need to know:New competitors at the 7B scale: Leanabell-Prover-V2-7B (July 2025) and Goedel-Prover-V2-8B (August 2025) now outperform DeepSeek Prover V2 7B at MiniF2F pass@32/128. Goedel-Prover-V2-32B currently holds state-of-the-art at 90.4% on MiniF2F in self-correction mode. DeepSeek Prover V2 7B remains the easiest to self-host with the widest llama.cpp/transformers ecosystem support.CUDA 12.4 + PyTorch 2.6 is the recommended stack in 2026. The old guide pinned CUDA 11.8 / PyTorch with --index-url .../cu118. That index is still available but lags current driver support. Use cu124 commands below.NVIDIA driver 560+ is required for CUDA 12.4 on Ubuntu 22.04 and 24.04. Ubuntu 24.04 has become the recommended OS (EOL April 2029 vs. 22.04's April 2027).GGUF quantized versions exist at multiple bit-depths (Mungert/DeepSeek-Prover-V2-7B-GGUF), making the model runnable on a 16 GB consumer GPU at Q8 or an 8 GB GPU at Q4_K_M — a major change from the 32 GB VRAM minimum cited in 2025 guides.The 88.9% MiniF2F benchmark applies to the 671B model, not the 7B. The 7B model scores lower (community-reported ~76–78% at high sample budgets). Many older guides conflated the two.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR

Question	Answer
Model repo	`deepseek-ai/DeepSeek-Prover-V2-7B` on Hugging Face
Context window	32K tokens
Target language	Lean 4 formal proofs
Minimum GPU (full BF16)	~16 GB VRAM (e.g., RTX 3080 Ti, A4000)
Minimum GPU (Q8 GGUF)	~10 GB VRAM (e.g., RTX 3080, 6700 XT)
Minimum GPU (Q4_K_M GGUF)	~5 GB VRAM — CPU fallback possible
Recommended OS	Ubuntu 22.04 LTS or 24.04 LTS
Recommended Python	3.11 or 3.12
Primary inference library	HuggingFace Transformers 4.51+ / llama.cpp (GGUF)
Best open alternative	Goedel-Prover-V2-8B (stronger benchmarks, similar VRAM)

What Is DeepSeek Prover V2 7B?

DeepSeek Prover V2 is a family of language models released by DeepSeek AI on April 30, 2025, specialized for formal mathematical reasoning in Lean 4. It comes in two sizes: 7B (built on DeepSeek-Prover-V1.5-Base, 32K context) and 671B (built on DeepSeek-V3-Base). This guide covers the 7B variant — the practical deployment target for teams without access to multi-GPU clusters.

Unlike general-purpose LLMs that can describe proofs informally, Prover V2 generates syntactically valid Lean 4 proof terms that can be checked by the Lean 4 kernel. It uses a training pipeline where DeepSeek-V3 decomposes complex theorems into subgoals, the 7B model proves each subgoal independently, and the chain-of-thought plus formal proof fragments are combined into training data via reinforcement learning.

Key Features

Generates complete Lean 4 proof scripts, not pseudocode
32K token context window supports long proofs and mathematical background
Open weights (Model License Agreement — commercial use check required)
Available in safetensors (BF16) and GGUF quantized formats
47,000+ downloads per month on Hugging Face as of April 2026

Benchmark Performance

The following numbers are drawn from the official paper (arXiv:2504.21801) and subsequent community evaluations. All MiniF2F results use pass@N notation — the probability at least one correct proof is found in N samples.

Model	MiniF2F pass@32	MiniF2F pass@8192	PutnamBench	Notes
DeepSeek Prover V2 671B (CoT)	82.4%	88.9%	49/658	State-of-art at release (Apr 2025)
DeepSeek Prover V2 7B	~76% (est.)	—	—	7B distilled variant
Goedel-Prover-V2 8B	84.6%	—	—	Outperforms 671B at pass@32 (Aug 2025)
Goedel-Prover-V2 32B	88.1% / 90.4%*	—	86/658	*self-correction mode; current SOTA
Leanabell-Prover-V2 7B	78.2% (pass@128)	—	—	Verifier-integrated RL (Jul 2025)
DeepSeek Prover V1.5	63.5%	—	—	Previous generation

Key takeaway: The 88.9% figure widely cited for DeepSeek Prover V2 refers to the 671B model at an extreme sample budget (pass@8192 — meaning 8,192 independent inference runs). The 7B model you can realistically run on a workstation performs in the 76–78% range at moderate budgets. That is still an enormous leap over V1.5's 63.5%, and better than any open model at the 7B scale as of mid-2025. As of August 2025, Goedel-Prover-V2-8B has overtaken it in benchmarks but has less deployment ecosystem support.

System Requirements and Prerequisites

Hardware

Config	GPU	VRAM	RAM	Notes
Full BF16 (transformers)	RTX A4000, RTX 3090, RTX 4080	16–24 GB	32 GB	Best quality, requires NVIDIA GPU
Q8 GGUF (llama.cpp)	RTX 3080, RTX 4070	10–12 GB	16 GB	Near-lossless, good throughput
Q4_K_M GGUF (llama.cpp)	RTX 3060, RTX 4060	5–6 GB	16 GB	Modest quality trade-off
CPU-only (no GPU)	None required	—	32+ GB	Very slow (~1–3 tok/s); for testing only

Operating System: Ubuntu 22.04 LTS or 24.04 LTS recommended. Other Debian-based distributions work; RHEL 8/9 is also supported by NVIDIA's CUDA packages.
CUDA: 12.1 minimum; 12.4 recommended (paired with driver 550+)
Python: 3.11 or 3.12 (3.13 not yet fully tested with all ML libraries as of early 2026)
Storage: The 7B model in safetensors BF16 is approximately 14–15 GB. GGUF Q8 is ~7.5 GB. Allocate at least 30 GB of free space including virtual environment.

Setting Up Your Linux Environment

Step 1: Update Your System and Install Core Tools

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip python3-venv git build-essential curl wget

If your Ubuntu default Python is older than 3.11, add the deadsnakes PPA:

sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 2: Install NVIDIA Drivers and CUDA 12.4

The old guide recommended nvidia-driver-525 and CUDA 11.8. Both are still available but are no longer the recommended path. Driver 560+ is required for CUDA 12.4 and provides better performance on Ada Lovelace and Hopper architectures.

Check your current driver:

nvidia-smi

If you need to upgrade or install fresh on Ubuntu 22.04/24.04:

# Install kernel headers first
sudo apt install -y linux-headers-$(uname -r)

# Install driver (choose the highest available version)
sudo apt install -y nvidia-driver-560

# Reboot to load the new driver
sudo reboot

After reboot, install the CUDA Toolkit 12.4 from NVIDIA's official repository (replace ubuntu2204 with ubuntu2404 if on 24.04):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-4

Add CUDA to your shell path (add to ~/.bashrc or ~/.zshrc):

export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

Verify the installation:

source ~/.bashrc
nvcc --version
nvidia-smi

Step 3: Create a Python Virtual Environment

python3.11 -m venv ~/deepseek-prover-env
source ~/deepseek-prover-env/bin/activate
pip install --upgrade pip setuptools wheel

Step 4: Install PyTorch with CUDA 12.4 Support

Use the CUDA 12.4 index URL for PyTorch 2.6.x (the current stable release as of early 2026):

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
  --index-url https://download.pytorch.org/whl/cu124

Verify CUDA is accessible from PyTorch:

python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
# Expected: True 12.4

Step 5: Install Python Packages

pip install \
  transformers>=4.51.0 \
  accelerate>=0.30.0 \
  bitsandbytes>=0.43.0 \
  einops \
  huggingface_hub

Package version notes for 2026:

transformers 4.51+ is required; earlier versions may not correctly handle the model's chat template or BF16 safetensors loading.
bitsandbytes 0.43+ supports 4-bit and 8-bit quantized loading (load_in_4bit / load_in_8bit), which lets you run the model on GPUs with less VRAM.
accelerate 0.30+ is required for device_map="auto" multi-GPU and CPU offload.

Downloading and Running DeepSeek Prover V2 7B

Step 6: Download and Run Inference

The model is hosted at deepseek-ai/DeepSeek-Prover-V2-7B on Hugging Face. The first run downloads approximately 14 GB of safetensors weights; subsequent runs load from the local cache.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "deepseek-ai/DeepSeek-Prover-V2-7B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",          # auto-distributes across available GPUs
    torch_dtype=torch.bfloat16, # BF16 for memory efficiency
    trust_remote_code=True
)

# Example: prove a simple set theory theorem in Lean 4
chat = [
    {
        "role": "user",
        "content": (
            "Prove in Lean 4 that for any two sets A and B, "
            "their intersection is a subset of A: A ∩ B ⊆ A."
        )
    }
]

inputs = tokenizer.apply_chat_template(
    chat,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    temperature=0.1,       # low temperature for deterministic proofs
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important: Use tokenizer.apply_chat_template (as shown above) rather than raw string concatenation. The model was trained with a specific chat format; manually constructed prompts often produce incomplete or malformed proofs.

Step 7: Running on Lower-VRAM GPUs with bitsandbytes

If you have a GPU with 8–12 GB VRAM, use 8-bit or 4-bit quantized loading:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 8-bit loading (~8–9 GB VRAM)
bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

# 4-bit loading (~4–5 GB VRAM, some quality loss)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model_id = "deepseek-ai/DeepSeek-Prover-V2-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,  # or bnb_config_4bit
    device_map="auto",
    trust_remote_code=True
)

Deploying with llama.cpp (GGUF — Recommended for Lower VRAM)

For researchers and practitioners who want a lightweight server without managing a Python environment, llama.cpp with GGUF quantized weights is the most practical path. If you are already running other local models via Ollama or Open WebUI, see the OpenClaw + Ollama setup guide for running local AI agents — many of the infrastructure patterns apply directly.

Step 8: Build llama.cpp with CUDA Support

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA support (requires CUDA toolkit in PATH)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Verify CUDA build
./build/bin/llama-cli --version

Note: The original guide used LLAMA_CUDA=1 make. That Makefile-based build is deprecated in recent llama.cpp releases; the CMake build above is the current recommended method. The compiled binaries land in build/bin/.

Step 9: Download a GGUF Quantized Model

The community-maintained GGUF repository for DeepSeek Prover V2 7B is Mungert/DeepSeek-Prover-V2-7B-GGUF on Hugging Face. Available quantization levels:

Quantization	File size	Min VRAM	Quality trade-off
Q8_0	~7.5 GB	~10 GB	Minimal — near-lossless
Q6_K	~5.9 GB	~8 GB	Very small
Q4_K_M	~4.4 GB	~6 GB	Moderate — recommended for <8 GB GPUs
Q3_K_M	~3.6 GB	~5 GB	Noticeable on long proofs

Download a specific quantization file:

pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='Mungert/DeepSeek-Prover-V2-7B-GGUF',
    filename='DeepSeek-Prover-V2-7B-Q8_0.gguf',
    local_dir='./models'
)
"

Step 10: Run Inference with llama.cpp

Interactive CLI mode:

./build/bin/llama-cli \
  --model ./models/DeepSeek-Prover-V2-7B-Q8_0.gguf \
  --prompt "Prove in Lean 4 that 2 + 2 = 4." \
  --ctx-size 8192 \
  --temp 0.1 \
  --n-predict 1024

Launch a local OpenAI-compatible HTTP server (useful for integrating with editors like VS Code + Lean extension):

./build/bin/llama-server \
  --model ./models/DeepSeek-Prover-V2-7B-Q8_0.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --port 8080

The server exposes POST /v1/chat/completions compatible with the OpenAI API. You can point any OpenAI-compatible client at http://localhost:8080.

Production Serving with vLLM

For teams that need to serve the model to multiple users concurrently, vLLM provides significantly higher throughput than the HuggingFace generate loop. vLLM 0.8.x+ supports DeepSeek Prover V2 7B out of the box.

pip install vllm>=0.8.0

python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-Prover-V2-7B \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --port 8000

This starts an OpenAI-compatible API server at http://localhost:8000. Use the standard OpenAI Python client to send requests:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-Prover-V2-7B",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational in Lean 4."}],
    temperature=0.1,
    max_tokens=2048
)
print(response.choices[0].message.content)

How to Choose Your Deployment Method

Use the following decision tree to find the right approach for your situation:

GPU with 16+ GB VRAM, want best output quality → HuggingFace Transformers in BF16 (Steps 3–7 above)
GPU with 8–15 GB VRAM → bitsandbytes 8-bit loading (Step 7) or GGUF Q8 via llama.cpp (Steps 8–10)
GPU with 4–8 GB VRAM → GGUF Q4_K_M via llama.cpp (Steps 8–10)
No GPU / testing only → GGUF Q4_K_M on CPU (expect 1–3 tokens/s; impractical for real proof search)
Serving to multiple users / production → vLLM (requires 16+ GB VRAM)
Integrating with a Lean 4 development environment → llama-server HTTP API (Step 10)
Need the strongest open-source theorem prover at the 7B scale → Consider Goedel-Prover-V2-8B (better MiniF2F score, similar deployment steps)

Tips for Effective Theorem Proving

Be specific about the theorem and target language. The prompt should explicitly mention "Lean 4" and state the theorem precisely in mathematical notation. Ambiguous prompts produce boilerplate or incomplete proofs.
Set temperature between 0.0 and 0.2. Formal proof generation is not a creative task — high temperature injects syntax errors. Most practitioners use 0.0 for greedy decoding or 0.1 for slight stochasticity.
Use pass@N for research workflows. Generate multiple candidate proofs (e.g., 32 samples) and validate each against the Lean 4 checker. This is how the benchmark numbers are produced and how you should think about production use.
Keep max_new_tokens generous. Non-trivial Lean 4 proofs routinely exceed 512 tokens. Set max_new_tokens=4096 or higher for complex theorems, especially if you are proving multi-step results.
Monitor GPU memory. Run nvidia-smi dmon -s mu in a second terminal during inference to catch OOM conditions early.
Implement retry logic. The model occasionally generates structurally invalid proofs that the Lean 4 kernel rejects. Wrapping inference in a retry loop (up to 5 attempts) is standard practice in automated proof pipelines.

Troubleshooting Common Issues

CUDA out of memory (OOM): Reduce context size (--ctx-size in llama.cpp or max_model_len in vLLM). Switch to a lower quantization level. If using HuggingFace, add max_memory={0: "12GiB"} to from_pretrained().
torch.cuda.is_available() returns False: Verify your CUDA install with nvcc --version and that you installed the cu124 wheel (not the default CPU-only torch). Confirm your driver with nvidia-smi — a mismatch between driver and CUDA toolkit version is the most common cause.
Model produces empty or truncated output: Check max_new_tokens — the default of 20 in some transformers versions is far too small for Lean 4 proofs. Also verify pad_token_id=tokenizer.eos_token_id is set when using do_sample=True.
llama.cpp build fails (CUDA not found): Confirm nvcc is in your PATH and the CUDA toolkit is correctly installed. The CMake flag is -DGGML_CUDA=ON, not the old LLAMA_CUDA=1.
trust_remote_code warning: The model's Hugging Face repo ships custom code. Setting trust_remote_code=True is required; review the model card and repo code if your security policy requires audit before enabling this flag.
Dependency conflicts: Always use a clean virtual environment. Installing bitsandbytes, transformers, and PyTorch into an existing environment with conflicting versions is the most common source of import errors.
SSH remote development / Jupyter: If running on a remote server, set up SSH port forwarding: ssh -L 8888:localhost:8888 user@server. For Jupyter: jupyter notebook --no-browser --port=8888.

Running on Cloud GPUs (If Local Hardware Is Insufficient)

If your local GPU does not meet the minimum requirements, cloud GPU providers offer per-hour access to appropriate hardware. Options as of 2026:

Lambda Labs: A100 40 GB instances; straightforward SSH setup. Good for short experiments.
Vast.ai: Competitive pricing on consumer GPUs; good for Q8/Q4 GGUF deployments.
RunPod: Template-based deployments; has preconfigured Jupyter environments.
GMI Cloud: Has published a specific guide for DeepSeek Prover V2 deployment.

For any cloud deployment, the steps in this guide apply identically after you have SSH access. The only difference is you typically skip the NVIDIA driver installation step (cloud instances ship with drivers pre-installed) and start from Step 3 (virtual environment setup).

If your team needs dedicated remote inference infrastructure rather than renting ad-hoc GPU time, Codersera's vetted ML engineers can help architect a self-hosted inference stack matched to your throughput and latency requirements.

Alternative Models to Consider in 2026

The formal theorem proving landscape moved quickly in 2025. Before committing to a production deployment of DeepSeek Prover V2 7B, evaluate these alternatives:

Model	Size	MiniF2F	When to prefer
DeepSeek Prover V2 7B	7B	~76–78% (pass@128)	Widest deployment ecosystem; easiest to set up
Leanabell-Prover-V2	7B	78.2% (pass@128)	Marginally stronger on MiniF2F; same hardware profile
Goedel-Prover-V2	8B / 32B	84.6% / 90.4%	Best open-source results; 32B needs 2× A100
DeepSeek Prover V2 671B	671B	88.9% (pass@8192)	Highest quality; multi-node GPU cluster required

FAQ

Does DeepSeek Prover V2 7B run on consumer GPUs like the RTX 4090?

Yes. The RTX 4090 has 24 GB VRAM, which comfortably fits the model in full BF16 precision (~14–15 GB). With the Q8_0 GGUF via llama.cpp, it also runs well on 16 GB cards like the RTX 3080 Ti or RTX 4080. For 8 GB cards (RTX 3070, RTX 4060 Ti), use Q4_K_M quantization.

What is the difference between the 7B and 671B model?

The 7B variant was fine-tuned from DeepSeek-Prover-V1.5-Base using synthetic data generated by the 671B model. It achieves lower benchmark scores (especially at low sample budgets) but can run on a single consumer GPU. The 671B model achieves 88.9% on MiniF2F at pass@8192 but requires a multi-GPU cluster. For most research teams, 7B is the only practical option for local inference.

Can I use this model for proofs other than Lean 4?

DeepSeek Prover V2 was specifically trained on Lean 4 proof data. It will not reliably produce valid Isabelle/HOL, Coq, or Agda proofs, though its informal mathematical reasoning can still be useful for proof sketching in any language.

How does the 88.9% MiniF2F score compare to human performance?

The MiniF2F benchmark contains 244 high-school mathematics competition problems formalized in Lean 4. A 88.9% pass rate (at 8,192 samples from the 671B model) represents state-of-the-art performance but is achieved at an enormous sample budget. Human experts can formally prove most of these problems but with significant effort. The meaningful comparison is that the model dramatically reduces the search time for proof engineers using Lean 4 as a verification layer.

Is the model free for commercial use?

DeepSeek Prover V2 is released under a custom Model License Agreement. It permits research and non-commercial use freely. Commercial use may require additional review depending on your organization's scale and use case. Check the current license on the Hugging Face model card before production deployment — license terms for open-weight models can be updated by the releasing organization.

Why does my proof look syntactically correct but fail in Lean 4?

The model generates Lean 4 syntax probabilistically — it can produce code that looks plausible but has subtle type errors or missing imports. Always run the generated proof through the Lean 4 kernel (lake build) to verify correctness. The model's 88.9% benchmark score is computed after this verification step; raw generation accuracy is lower.

What is ProverBench?

ProverBench is a new benchmark dataset introduced alongside DeepSeek Prover V2, comprising 325 formalized problems: 15 from AIME 2024–2025, plus 310 problems drawn from textbooks and educational materials spanning number theory, algebra, calculus, and analysis. The DeepSeek Prover V2 671B model solved 6 of the 15 AIME problems (DeepSeek-V3 solved 8 using informal reasoning), highlighting both the model's capability and the remaining gap on competition mathematics.

Does this work with the Lean 4 VS Code extension?

Not directly — the model outputs text, not integrated suggestions. You can use the llama-server OpenAI-compatible endpoint (Step 10 above) as a backend for custom VS Code extensions or Lean 4 tactic scripts that call an LLM API. There is no first-party plugin as of April 2026, but the llama.cpp server approach allows integration with any tool that supports OpenAI-compatible HTTP calls.

Running DeepSeek Prover V2 7B on Linux: A Complete 2026 Guide