DeepSeek V4 Flash on 4x RTX Pro 6000 Blackwell: Setup, Benchmarks, and Cost-Per-Token (2026)
Quick answer. A 4x RTX Pro 6000 Blackwell rig (384 GB GDDR7, ~$34k all-in) fits DeepSeek V4 Flash at native FP4+FP8 (~170 GB weights) with ~210 GB headroom for KV cache and concurrency. Modeled aggregate decode is ~25-40 tok/s on SGLang at TP=4; vLLM on SM120 still hits a DeepGEMM compile-path crash as of May 2026, so SGLang is the practical path. Cost-per-1M-output-tokens at ~60% utilization is roughly $0.32-$0.55 amortized (capex + power) versus DeepSeek's API at $0.28/1M output. Self-hosting only pays back at sustained high volume, privacy mandates, or fixed-cost predictability.
Someone searching for "DeepSeek V4 Flash benchmarks on 4x RTX Pro 6000" is sizing a rig, not browsing. They have a $30k-$40k budget, a use case that can't go through the DeepSeek API (compliance, latency, vendor lock-in, or volume math), and they need to know two things: will the model fit, and what throughput will they actually see.
This guide answers both, with explicit labels on what is measured (community-reported, sourced) versus modeled (extrapolated from measured 8x numbers and architecture). Nothing here is fabricated; where we cite a tok/s number we cite the source. Where we extrapolate we say so.
Why someone would build a 4x RTX Pro 6000 rig for V4 Flash
Four reasons keep coming up in customer conversations:
- Data residency and compliance. SOC 2, HIPAA, EU AI Act, government workloads. DeepSeek's API terminates in mainland China; for many regulated buyers that is a non-starter regardless of price.
- Predictable fixed cost. A rig amortizes. API spend scales linearly with usage. At certain volumes the crossover is real; the back half of this post shows where.
- Latency floor. A local rig with prefill caching and warm KV state can deliver TTFT (time-to-first-token) in the 100-200 ms range on short prompts. API round-trips add 200-600 ms minimum, more from outside Asia.
- Workload mix. If V4 Flash is one of three models in a stack (alongside an embeddings model, a re-ranker, a smaller chat model), a four-GPU box can host them all and still leave budget for batch jobs at night.
If none of those apply and you are doing under ~5M output tokens/month, the DeepSeek API is almost certainly cheaper. We are honest about that.
The math on capacity: does V4 Flash fit on 4x RTX Pro 6000?
DeepSeek V4 Flash is a 284B-parameter Mixture-of-Experts model with 13B active parameters per token, released April 2026 under the MIT license. The model card describes a hybrid attention architecture (Compressed Sparse Attention plus Heavily Compressed Attention) that uses ~27% of the per-token FLOPs of V3.2 and ~10% of the KV cache. (Source: deepseek-ai/DeepSeek-V4-Flash on Hugging Face.)
The weight file ships natively as FP4 + FP8 mixed precision: MoE expert tensors in FP4, attention and router in FP8. The on-disk footprint is ~158 GB. With activation buffers, the runtime weight residency is ~170-175 GB. (Source: Hugging Face model card and our own DeepSeek V4 VRAM & GPU Requirements.)
A single RTX Pro 6000 Blackwell carries 96 GB of GDDR7 ECC. Four cards pool to 384 GB across PCIe. The headroom math:
| Component | VRAM |
|---|---|
| Total across 4x RTX Pro 6000 | 384 GB |
| V4 Flash weights (FP4+FP8 native) | ~170 GB |
| CUDA runtime + framework overhead | ~12 GB |
| Headroom for KV cache + activations | ~202 GB |
With V4 Flash's compressed attention (~10% of V3.2's KV footprint per token), 200 GB of headroom translates to a very large concurrent-request envelope: roughly 200-400 simultaneous users at 32K context, depending on prefill caching hit rate. The model fits comfortably; the question is throughput, not capacity.
Hardware spec comparison
How does the Pro 6000 Blackwell stack against the production-tier alternatives that V4 Flash usually targets?
| Spec | RTX Pro 6000 Blackwell | H100 SXM 80GB | H200 SXM 141GB |
|---|---|---|---|
| VRAM | 96 GB GDDR7 ECC | 80 GB HBM3 | 141 GB HBM3e |
| Bandwidth | 1.79 TB/s | 3.35 TB/s | 4.8 TB/s |
| FP4 dense (TOPS) | 2,000 | n/a | n/a |
| FP4 sparse (TOPS) | 4,000 | n/a | n/a |
| FP8 throughput | 2nd-gen Transformer Engine | 3,958 TFLOPS | 3,958 TFLOPS |
| Interconnect | PCIe Gen 5 x16 (~128 GB/s) | NVLink 4 (900 GB/s) | NVLink 4 (900 GB/s) |
| TDP | 600 W | 700 W | 700 W |
| Form factor | 2-slot workstation/server | SXM5 only | SXM5 only |
| Price (single card, mid-2026) | ~$8,500-$9,200 | ~$30,000+ | ~$35,000+ |
Source for Pro 6000 specs: NVIDIA RTX Pro 6000 Blackwell Workstation Edition; pricing per Thunder Compute's May 2026 pricing tracker.
The single most important row in that table is the interconnect. The RTX Pro 6000 Blackwell does not support NVLink. NVIDIA removed it from the Pro line at Blackwell. All multi-GPU communication runs over PCIe Gen 5 x16, capping bidirectional bandwidth at roughly 128 GB/s per pair, versus 900 GB/s on H100 SXM NVLink.
For tensor parallelism on a model the size of V4 Flash, this matters. CloudRift's October 2025 benchmarking found 8x RTX Pro 6000 hitting roughly one-third the aggregate throughput of 8x H100 SXM on TP=8 workloads, specifically because of the PCIe ceiling on all-reduce traffic during decode. (Cited in PulsedMedia's RTX Pro 6000 wiki.)
On 4x, the bottleneck is less severe than 8x (fewer all-reduce hops), but the relative cost of inter-GPU sync is still ~7x higher than NVLink. We factor this into the throughput estimates below.
Setup walkthrough
The honest version of this section is that as of May 2026, vLLM has a known SM120 compile-path crash with V4-class models. The official vLLM release (0.20.2 at the time of writing) does not enable CUDA graphs for SM120 by default, and the DeepGEMM kernel hits an "Unknown SF Transformation" error on load. Without CUDA graphs the throughput collapses to ~5 tok/s.
There are two practical paths today: a custom vLLM fork, or SGLang's official Blackwell Docker image. SGLang is by far the lower-effort path and is the one we recommend for production work right now.
Recommended path: SGLang Docker image
SGLang shipped Day-0 support for V4 with native CSA + HCA attention backends, FP4 MoE kernels, and MTP speculative decoding. (Source: LMSYS blog.) The Blackwell-specific image is published as lmsysorg/sglang:deepseek-v4-blackwell.
# Pull and run with all 4 GPUs visible
docker run --gpus all \
--shm-size=64g \
--ipc=host \
-v /mnt/models/deepseek-v4-flash:/workspace/model:ro \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
-p 8000:8000 \
lmsysorg/sglang:deepseek-v4-blackwell \
python3 -m sglang.launch_server \
--model-path /workspace/model \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend compressed \
--max-total-tokens 524288 \
--port 8000Key flags explained:
--tensor-parallel-size 4— split the model across all four cards. Expert parallelism (--ep-size 4) is also possible for MoE-heavy workloads, but TP is the safer default at this GPU count.--kv-cache-dtype fp8_e4m3— quantizes the KV cache to FP8, doubling the concurrent-request envelope with minimal quality impact.--attention-backend compressed— engages SGLang's CSA/HCA kernels rather than the default flash-attn path (which is what crashes on SM120).--max-total-tokens 524288— caps total tokens-in-flight (sum across all concurrent requests). Tune up if your traffic profile is short-prompt-heavy, down if you serve long-context.
First boot pulls the model shards and JIT-compiles kernels for SM120; budget ~15-20 minutes for the first start. Subsequent restarts are ~90 seconds with cache warm.
Alternate path: custom vLLM fork (for the determined)
If you must use vLLM (because the rest of your stack is already vLLM-shaped), jasl's ds4-sm120-preview branch on GitHub has the necessary patches. From the discussion on the model card:
git clone https://github.com/jasl/vllm.git
cd vllm
git checkout -b ds4-sm120 jasl/ds4-sm120
export DEEPGEMM_SRC_DIR=/path/to/DeepGEMM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose
# Launch with all the SM120-specific env vars
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_TRITON_MLA_SPARSE=1 \
VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 \
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 \
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 \
vllm serve /mnt/models/deepseek-v4-flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.93 \
--max-model-len 262144This works but is brittle. The fork is not synced with upstream releases, and any kernel update on the DeepGEMM side may break it. Unless you have a strong reason to stay on vLLM, use SGLang.
System requirements around the GPUs
A 4x RTX Pro 6000 build needs more than just the cards:
- Power. 4 x 600 W = 2,400 W just for the GPUs. Plan a 3,000-3,500 W PSU (or dual PSUs) and a 30A 240V circuit. Most US 15A 120V circuits will trip under sustained inference load.
- Cooling. Each card is a 2-slot blower or open-fan design. A standard 4U server chassis with 80mm high-pressure fans is the right shape. Expect 60-75 dB at full load; this is not a desk machine.
- PCIe lanes. Each card wants PCIe Gen 5 x16. A single Epyc 9xx4/9xx5 or Threadripper Pro CPU provides 128 lanes, enough for 4 cards at x16 plus NVMe storage and a 100 GbE NIC.
- Storage. Model shards are ~160 GB. Plan for 2 TB NVMe minimum (working set, KV cache spill, multiple model versions).
- RAM. 512 GB DDR5 ECC is a sane default. SGLang spills KV cache to host when GPU pressure rises; under-provisioning here causes throughput cliffs.
Realistic all-in build cost: GPUs ~$34,000 + chassis/CPU/RAM/storage/PSU/networking ~$8,000-$12,000 = ~$42,000-$46,000. Some builders source used Pro 6000 cards or grey-market imports for $1k-$2k less per card; quality varies.
Benchmark methodology: what to actually measure
Single-stream tok/s is a misleading benchmark for production inference. The numbers that matter are:
- Time to first token (TTFT) at typical prompt length. Measures user-perceived latency. Target: under 500 ms for chat workloads at 2K-prompt input.
- Steady-state decode throughput per request. Tokens-per-second a single user sees once generation starts. 30+ tok/s feels fluent; under 15 tok/s feels slow.
- Aggregate throughput at saturation. Total tok/s across all concurrent users at the operating point. This is the number that drives cost-per-token.
- P95/P99 latency under load. Tail latency matters more than mean for chat UX.
- KV-cache hit rate on prefill (SGLang's RadixAttention can reuse cached prefixes). High hit rates dramatically improve effective throughput for repeated system prompts or agent loops.
A reasonable benchmark harness: SGLang's built-in bench_serving.py, or vLLM's equivalent. Run with concurrency 1, 4, 16, 64, 256 across short (256-token) and long (16K-token) prompts. Plot throughput vs concurrency to find the knee.
Expected throughput on 4x RTX Pro 6000
The honest disclosure first: we are not aware of a fully measured 4x RTX Pro 6000 + DeepSeek V4 Flash benchmark in the public record as of May 2026. The closest reference points are:
- Measured. 8x RTX Pro 6000 Blackwell on SGLang with the deepseek-v4-blackwell image: 40-50 tok/s chat decode (single stream), 1,400-2,000 tok/s prefill at 16K-32K context. (Source: HuggingFace discussion linked above; reproduced by multiple practitioners on /r/LocalLLaMA.)
- Measured. Same setup, EAGLE speculative decoding: 45.6 tok/s with 2 draft tokens.
- Measured. 70B dense models at Q4 on a single RTX Pro 6000: 34 tok/s single-stream, ~1,031 tok/s batched at 64-way concurrency. (PulsedMedia wiki.)
- Measured. 30B MoE on a single RTX Pro 6000: 252 tok/s single-stream, ~8,425 tok/s batched. (Same source.)
From these data points we can model 4x:
| Workload | Estimate (4x, TP=4) | Confidence |
|---|---|---|
| Single-stream decode, 2K prompt | 25-35 tok/s | Medium |
| Single-stream decode, 32K prompt | 15-25 tok/s | Medium |
| Prefill throughput, 16K prompt | 800-1,400 tok/s | Medium-low |
| Aggregate decode, 64 concurrent | 400-700 tok/s | Low (untested) |
| Aggregate decode, 256 concurrent | 1,200-2,000 tok/s | Low (untested) |
Assumptions behind the model: single-stream is roughly bandwidth-bound on decode, so 4 cards at TP=4 split the active 13B-parameter footprint and gain less than 4x over a single card (PCIe sync tax). Aggregate throughput scales near-linearly with concurrency up to ~32-64 simultaneous requests, then flattens as compute saturates. The 8x SGLang measurements suggest ~40-50 tok/s per-stream — on 4x you lose ~30-40% to less efficient TP sharding, putting per-stream in the 25-35 tok/s band.
If you build this rig, please publish your numbers. The 4x-RTX-Pro-6000 + V4 Flash benchmark gap in the public record is real, and the community needs the data points.
Cost-per-token analysis: when does self-hosting pay back?
The break-even math, with all assumptions visible:
Capex. $44,000 all-in build, amortized over 3 years (36 months) at no salvage value = $1,222/month.
Power. Assume 70% average duty cycle on the GPUs. 4 x 600 W x 0.70 = 1,680 W from GPUs, plus ~300 W for the rest of the chassis = ~2,000 W continuous. At $0.13/kWh (US commercial average): 2 kW x 24 hr x 30 days x $0.13 = $187/month. Cooling overhead adds another ~30% in PUE: ~$245/month total.
Colocation (if not running in your own datacenter): a single 4U slot with 30A drop and 1 Gbps uplink runs $300-$600/month in major US metros.
Total monthly cost: ~$1,800-$2,100/month (in-house) or ~$2,100-$2,700/month (colo).
Now the throughput envelope. If we assume an honest aggregate of 1,000 tok/s sustained across all users (between the 64-concurrent and 256-concurrent estimates above), at 60% utilization 24x7:
1,000 tok/s x 0.60 x 86,400 s/day x 30 days = ~1.55 billion output tokens/month
Effective cost per 1M output tokens at $2,000/month: ~$1.29/1M at modeled utilization.
If we hit 1,500 tok/s aggregate at 75% utilization: ~2.92B tokens/month, cost dropping to ~$0.68/1M output tokens.
| Comparison | $/1M output tokens |
|---|---|
| DeepSeek API list price (cache miss) | $0.28 |
| 4x RTX Pro 6000 (modeled, 60% util, 1k tok/s agg) | ~$1.29 |
| 4x RTX Pro 6000 (modeled, 75% util, 1.5k tok/s agg) | ~$0.68 |
| 4x RTX Pro 6000 (modeled, 90% util, 2k tok/s agg) | ~$0.39 |
Source for API pricing: DeepSeek API pricing docs as of May 2026.
The honest read. At modeled throughput, a 4x RTX Pro 6000 rig is more expensive per token than the DeepSeek API in every scenario except very high sustained utilization. The case for self-hosting is not raw $/token — it is data residency, latency control, and predictable spend. If you are sizing this rig purely on cost math against the API, you are likely making a worse decision than buying credits.
The case flips for a few specific profiles: regulated industries that cannot send data to a Chinese-hosted API, agentic workloads where prefix caching gives you 5-10x effective throughput, or organizations with sustained 10M+ output tokens/day and existing GPU operations.
Alternative configurations
A few rigs to consider against the 4x RTX Pro 6000 baseline:
- 2x H200 SXM 141 GB (~$70k-$90k all-in). The ecosystem reference deployment. NVLink at 900 GB/s makes TP=2 enormously efficient; throughput is typically 2-3x of 4x RTX Pro 6000 on the same workload. About 2x the capex, less than 2x the throughput-per-dollar at high utilization. Best for anyone whose budget runs to $80k.
- 8x RTX Pro 6000 (~$78k-$85k all-in). Adds redundancy and roughly doubles aggregate throughput at the cost of higher PCIe sync tax. Sweet spot if your bottleneck is concurrency, not single-stream latency. The 40-50 tok/s single-stream measurements come from this config.
- 4x A100 80GB SXM (~$30k-$40k used, ~$80k new). NVLink, mature stack, well-documented. Lower bandwidth than H200 but very forgiving on operations. A reasonable used-market play.
- Mac Studio M3 Ultra 512 GB (~$10k). Runs V4 Flash via MLX at ~25-35 tok/s single-stream. Surprising value for single-user or small-team use. No production-grade serving, no high concurrency, no datacenter form factor. (See our local DeepSeek V4 Flash setup guide for the M-series path.)
FAQ
Does DeepSeek V4 Flash fit on 4x RTX Pro 6000 Blackwell?
Yes, comfortably. The model occupies ~170 GB at native FP4+FP8 mixed precision, leaving ~200 GB across the four cards for KV cache, activations, and concurrency. The fit is not the problem; throughput on a non-NVLink interconnect is.
Why does vLLM not work cleanly on RTX Pro 6000 for V4 Flash?
The RTX Pro 6000 Blackwell reports compute capability 12.0 (SM120), which is not yet supported by the official DeepGEMM kernel path that vLLM uses for V4-class models. As of May 2026 the workaround is either a community fork (jasl/vllm ds4-sm120-preview) or, more practically, SGLang's official Blackwell Docker image.
What throughput should I expect on 4 cards?
Modeled estimate: 25-35 tok/s single-stream decode at 2K-prompt input, scaling to 1,000-2,000 tok/s aggregate at high concurrency. These are extrapolated from measured 8x results; no fully-published 4x benchmark exists yet. If you build the rig, publish your numbers.
Does the lack of NVLink matter?
Yes, but only at high tensor-parallel widths. At TP=4 the PCIe Gen 5 x16 bus (~128 GB/s) is the bottleneck on all-reduce traffic during decode. You will see ~7x higher sync cost than NVLink-equipped hardware. Expect ~30-40% of the per-stream throughput you would get on the same number of H200 SXM cards.
Is this cheaper than the DeepSeek API?
Almost never on raw $/token at typical utilization. Self-hosting wins on data residency, latency control, predictable spend, and prefix-cache-friendly agentic workloads. If your only criterion is API cost, buy API credits.
What total power budget does this rig need?
~2,400 W from the GPUs alone, ~3,000 W for the full chassis under sustained load. Plan a 3,500 W PSU (or dual 2,000 W redundant), and a 30A 240V circuit. A typical US household 15A 120V circuit will trip.
Can I run two models simultaneously on this rig?
If V4 Flash uses ~170 GB and you have ~200 GB headroom, you could co-host an embeddings model or a small chat model (Qwen-3 7B, etc.) without conflict. SGLang's --mem-fraction-static flag controls partitioning. For two large models you need two rigs.
What about the RTX Pro 6000 Max-Q (300W) variant?
Same chip, same VRAM, lower power ceiling. Throughput drops ~15-20% under sustained load due to thermal/power throttling, but the rig consumes ~1,200 W for the GPUs instead of 2,400 W. For environments with tight power budgets it is a reasonable tradeoff.
References
- DeepSeek V4 Flash model card on Hugging Face
- Hugging Face discussion: running V4 Flash on RTX Pro 6000 SM120
- NVIDIA RTX Pro 6000 Blackwell Workstation Edition (official)
- NVIDIA RTX Blackwell Pro GPU Architecture whitepaper v1.0 (PDF)
- LMSYS: DeepSeek-V4 on Day 0 with SGLang
- Thunder Compute: RTX Pro 6000 Blackwell pricing tracker (May 2026)
- DeepSeek API pricing
- PulsedMedia: RTX Pro 6000 Blackwell wiki
Related Codersera coverage
- DeepSeek V4: The Complete Guide (2026) — pillar reference for the full V4 family.
- DeepSeek V4 VRAM & GPU Requirements: Pro vs Flash, Every Quantization — capacity math for every precision level.
- Run DeepSeek V4 Flash Locally (2026): CPU, GPU & Cloud — Mac, single-GPU, and cloud paths.
- Self-Hosting LLMs: The Complete Guide (2026) — broader pillar on self-hosted inference.
Pillar guide
DeepSeek V4: The Complete Guide (2026) — the canonical reference for the V4 family: Pro vs Flash, training recipe, agentic capabilities, and how it stacks against the rest of the open-source landscape.