DeepSeek

DeepSeek V4 VRAM & GPU Requirements: Pro vs Flash, Every Quantization (2026)

A single tiered table for DeepSeek V4 Pro vs Flash VRAM at every quantization (FP8, FP4+FP8, INT4, Q2) — triangulated across official and community sources.

Published 18 May 2026 • Updated 18 May 2026 • 8 min read

Quick answer. DeepSeek V4 Flash is the realistically self-hostable variant: budget ~170-175 GB total VRAM at native FP4+FP8 (2x H200 or 4x A100 80GB), or ~90-100 GB at community INT4 (4x RTX 4090). V4 Pro needs ~1 TB+ of VRAM and is a datacenter cluster job, not a workstation model.

Ask "how much VRAM does DeepSeek V4 need?" today and you get five different answers. Some pages quote 33 GB, some quote 1.2 TB, and most of them quietly conflate the Pro and Flash variants, mix base and instruct weights, or forget the KV cache entirely. This article fixes that with one tiered table, triangulated across the official Hugging Face model cards, the DeepSeek V4 launch writeup, Unsloth's quant notes, and credible deployment guides. Where sources genuinely disagree, we give a range and cite both rather than inventing a precise number.

The short version: V4 Flash (284B total, 13B active) is the only variant a serious self-hoster should target. V4 Pro (1.6T total, 49B active) is a multi-node datacenter model — closer to running GPT-4-class infrastructure than to a homelab build. If you're still deciding which variant fits your use case, our breakdown of DeepSeek V4 Pro vs Flash on performance and pricing covers the trade-offs in depth.

What are the DeepSeek V4 VRAM requirements at a glance?

This is the citable asset. Numbers are weights only in the size column; the "Min VRAM" column already adds working room for the 1M-token KV cache and runtime overhead. All figures are for the instruct checkpoints unless noted (base checkpoints are FP8-throughout and larger — see the base-model row).

DeepSeek V4 Flash (the self-hostable one)

Quantization	Approx. weights	Min VRAM (incl. KV + overhead)	Example GPU / host
BF16 (de-quantized)	~520-570 GB	~640 GB	8x H100 80GB / 8x H200 (rarely worth it)
FP8 mixed (base ckpt)	~290-295 GB	~320 GB	4x H100 80GB / 4x A100 80GB
FP4+FP8 native (instruct, recommended)	~158-160 GB	~170-175 GB	2x H200 (282GB) or 4x A100 80GB
Q6 / Q5 GGUF	~100-120 GB	~128-160 GB	2x H100 80GB / 1x H200 (tight)
INT4 / Q4 (community AWQ, ~5% quality loss)	~80 GB	~90-100 GB	4x RTX 4090 24GB (96GB) or 2x RTX 6000 Ada
Q3 GGUF	~60 GB	~80-96 GB	1x H100 80GB (very tight) / 2x RTX 6000 Ada
Q2 (extreme, quality risk)	~40 GB	~48-64 GB	2x RTX 4090 / 1x RTX 6000 Ada 48GB

DeepSeek V4 Pro (the datacenter one)

Quantization	Approx. weights	Min VRAM (incl. KV + overhead)	Example host
BF16 (de-quantized)	~3.2 TB	~3.5 TB+	Multi-node cluster only
FP8 mixed (base ckpt)	~1.6 TB	~1.7 TB+	Multi-node H200/B200 cluster
FP4+FP8 native (instruct)	~862 GB	~1.0-1.2 TB	8x H200 141GB / DGX H200 (single node) or 16x H100 multi-node
Q4 (experimental)	~430 GB	~512-640 GB	8x H100 80GB (640GB) — still cluster-class
Q2 (research only)	~216 GB	~256-320 GB	4x H100 80GB — heavy quality loss, not recommended

The single most important takeaway from these two tables: even Q4 V4 Pro (~430 GB) does not fit on an 8x H100 80GB box (640 GB) once you add the KV cache and overhead headroom for 1M context, and the native instruct checkpoint (~862 GB) does not fit at all. Pro is cluster-only at every realistic precision.

How should you read this table without getting burned?

Three caveats decide whether your real deployment matches these numbers:

"Active parameters" does not reduce VRAM. V4 Flash activates only 13B of its 284B parameters per token, but all expert weights have to be resident in GPU memory (or on a fast offload tier) for the router to reach any of them. Active params cut compute and latency, not footprint. This is the single most common source of the wildly-optimistic "33 GB" claims floating around — they're computing 13B x 2 bytes and ignoring the other 271B parameters.
The KV cache is small on V4 — but not zero. V4's hybrid Compressed Sparse Attention is genuinely efficient: DeepSeek reports V4 uses roughly 2% of the KV cache a standard 8-head GQA model would need, and Flash's 1M-token cache lands around ~10 GB. That is why a 158 GB weight set turns into a ~170-175 GB total budget rather than the 300 GB+ you'd expect from older long-context models. Shorter contexts shrink this further.
Base vs instruct matters. The instruct checkpoints ship in native FP4+FP8 mixed precision (FP4 for MoE experts, FP8 for attention/norm/router). The base checkpoints are FP8 throughout and therefore roughly 2x larger. Quote the instruct numbers unless you specifically need the base model for fine-tuning.

If a source gives you a single precise VRAM figure with no mention of which variant, which checkpoint, or what context length, treat it as unreliable. The honest answer is always a range tied to those three variables.

What is the cheapest realistic setup for DeepSeek V4 Flash?

There are three credible tiers for actually running Flash, from cheapest to most robust:

Community INT4 / AWQ, ~90-100 GB total — cheapest viable. A 4x RTX 4090 24GB box (96 GB aggregate) or 2x RTX 6000 Ada can hold an INT4 build. Expect roughly 5% quality degradation versus native FP4+FP8, concentrated on reasoning-heavy tasks. This is not the official recipe — it's a community AWQ/INT4 quant — but it is the realistic entry point for self-hosters who own consumer GPUs. Once Flash is serving, you can point a self-hosted AI coding agent at it for a fully local Cline/Continue workflow.
Native FP4+FP8 on 2x H200 — the recommended target. ~170-175 GB total fits comfortably on 2x H200 (282 GB) or 4x A100 80GB (320 GB). This is the configuration most deployment guides converge on as the "production Flash" sweet spot. On cloud, a 2x H200 pod runs in the rough neighborhood of single-digit dollars per hour on providers like RunPod; an 8x H100 instance with comfortable headroom is materially more expensive (think tens of dollars per hour on-demand). For a wider comparison of providers and rates, see our rundown of the best cloud GPUs for large language models.
Official vLLM recipe, 4x H200 — production-grade. DeepSeek's and the ecosystem's reference deployment for Flash is a 4x H200 node under vLLM with MoE expert parallelism (our production benchmark of vLLM vs Ollama vs LM Studio explains why vLLM is the serving runtime that scales here). Use this when you need full 1M context plus serving concurrency, not just a dev box.

One hardware gotcha worth flagging: as of May 2026, multiple deployment notes report that RTX Pro 6000 Blackwell cards hit a vLLM Inductor compile-path crash on load with V4. The card has the memory on paper, but the software path is not reliable yet — verify current vLLM/driver status before you build a Blackwell box specifically for this model.

For a single-GPU dev/test box, a lone H200 (141 GB) can host a quantized Flash build for experimentation, but it is tight and not a production answer. There is no consumer single-card configuration for native Flash — the smallest honest single-GPU story is a heavily-quantized GGUF on an H100/H200, with measurable quality loss. Once your hardware is sized, our full DeepSeek V4 Flash local setup guide walks the actual install and serving steps end to end.

Companion guide

For the full picture on architecture, benchmarks, pricing, and where V4 fits versus other models, see our DeepSeek V4 Complete Guide (2026).

Why is DeepSeek V4 Pro cluster-only?

V4 Pro is a 1.6T-total-parameter MoE. Even in its most compact native FP4+FP8 instruct form the weights are ~862 GB on disk. Three structural reasons make it a cluster job:

It doesn't fit on a single 8x80GB node. 8x H100 80GB gives you 640 GB of VRAM. The 862 GB native checkpoint overflows that before you load a single KV-cache token. You need 8x H200 141GB (~1,128 GB, single node) or a multi-node H100/H200 setup with InfiniBand.
Quantizing Pro doesn't rescue it. A hypothetical Q4 Pro is still ~430 GB of weights, and once you add 1M-context KV cache plus overhead you're back above what 8x H100 can hold. There is no quant level that turns Pro into a workstation model — only into a slightly smaller cluster model with degraded quality.
The experts demand parallelism. Serving a 1.6T MoE at usable throughput means expert-parallel sharding across many GPUs with high-bandwidth interconnect (NVLink within a node, InfiniBand across nodes). This is operationally a distributed-systems problem, not a "pip install vllm" problem.

If you need Pro-tier quality and don't run a datacenter, the honest recommendation is to use the hosted DeepSeek API or a GPU-cloud endpoint rather than self-hosting. Self-hosting only makes economic and operational sense at Flash scale.

How did we triangulate these numbers?

No single source has a clean, complete, trustworthy table — which is exactly why fragmented and contradictory numbers are everywhere. We cross-checked:

Official primary sources: the DeepSeek V4 Hugging Face model cards (deepseek-ai/DeepSeek-V4-Pro, deepseek-ai/DeepSeek-V4-Flash) and the official DeepSeek V4 launch writeup — for parameter counts (1.6T/49B Pro, 284B/13B Flash), precision format (FP4 experts + FP8 rest for instruct, FP8 throughout for base), and KV-cache efficiency claims (~2% of standard GQA; Flash ~7%, Pro ~10% of V3.2 at 1M tokens).
Quant ecosystem: Unsloth's V4-Flash GGUF notes (Q4_K_M positioned as the quality sweet spot fitting ~1x80GB or 2x48GB) and community INT4/AWQ builds reporting ~80 GB and ~5% quality loss.
Deployment guides: independent self-hosting writeups converging on ~158-160 GB Flash weights / ~170-175 GB total on 2x H200, the 4x H200 official vLLM recipe, the Blackwell vLLM crash caveat, and the "Pro doesn't fit 8x H100" finding.

Where deployment guides disagreed (for example, exact Flash weight size is variously cited as ~158 GB or ~159.6 GB, and "safe" VRAM headroom varies by guide), we present a range rather than a false-precision single number. Vendor-reported KV-cache efficiency figures are labeled as DeepSeek's own claims, not independently benchmarked here.

FAQ

Can I run DeepSeek V4 on a single RTX 4090?

No. A single RTX 4090 has 24 GB of VRAM. The smallest realistic V4 Flash build (extreme Q2) needs ~48-64 GB, and even that comes with serious quality loss. A practical INT4 Flash build needs roughly 4x RTX 4090. There is no single-consumer-GPU configuration for any usable V4 variant.

How much VRAM does DeepSeek V4 Flash need at native FP4+FP8?

Plan for ~170-175 GB of total VRAM: roughly 158-160 GB of FP4+FP8 weights, about 10 GB for the full 1M-token KV cache, and a few GB of runtime overhead. That fits on 2x H200 or 4x A100 80GB.

Why do different sites give such different VRAM numbers for DeepSeek V4?

Three reasons: they conflate Pro (1.6T) with Flash (284B); they confuse the FP8 base checkpoint with the smaller FP4+FP8 instruct checkpoint; and they cite the 13B active parameter count as if it were the memory footprint. All weights must be resident regardless of how few activate per token.

Does the 1M-token context blow up VRAM on DeepSeek V4?

Far less than on older long-context models. V4's hybrid Compressed Sparse Attention keeps the KV cache to roughly 2% of what standard grouped-query attention would need. For Flash, the full 1M-token cache is around 10 GB — small relative to the ~158 GB weights.

Is DeepSeek V4 Pro self-hostable at all?

Only on datacenter clusters. The native instruct checkpoint is ~862 GB and needs ~1 TB+ of VRAM — an 8x H200 single node or multi-node H100/H200 with InfiniBand. It does not fit on an 8x H100 80GB box even at Q4. For most teams the hosted API is the right answer.

What's the cheapest credible way to self-host V4 Flash?

A community INT4/AWQ build on 4x RTX 4090 (~90-100 GB total) is the cheapest viable path, with ~5% quality loss on reasoning. The recommended production setup is native FP4+FP8 on 2x H200; the official vLLM reference is a 4x H200 node.

Should I avoid RTX Pro 6000 Blackwell for V4 right now?

As of May 2026, multiple deployment notes report a vLLM Inductor compile-path crash on load with V4 on RTX Pro 6000 Blackwell. The memory is sufficient on paper, but verify the current vLLM and driver status before committing to a Blackwell build for this specific model.

If you're hiring vetted remote developers experienced with self-hosted LLM infrastructure — vLLM tuning, MoE expert parallelism, multi-GPU and multi-node serving, and the kind of capacity planning this article walks through — codersera.com/hire can match you with engineers who have shipped exactly this. It's a fast way to lower the hiring risk on a notoriously specialized skill set.