On April 24, 2026, DeepSeek shipped two models on the same day: DeepSeek V4-Pro (1.6T total / 49B active parameters) and DeepSeek V4-Flash (284B total / 13B active). Both share the same architecture family, both were trained on 32T+ tokens, and both ship with a legitimate 1M token context window. The headline is the cost gap: Flash output tokens are roughly 12.4× cheaper than Pro at sticker. The surprise is the performance gap, or rather, how small it is. As the Latent Space digest put it bluntly: "Flash@max ≈ Pro@high on reasoning tasks."
This is the DeepSeek V4 Pro vs Flash comparison engineering leaders actually need: not a press-release recap, but a hard look at where the 5-point Artificial Analysis Intelligence Index gap shows up in production, where it doesn't, and how to map each variant to real workloads. We'll cover benchmarks, cost-per-task economics, provider speed tiers, local deployment feasibility, and a use-case decision tree. If you're also weighing DeepSeek V4 against Western frontier models, see our companion pieces on DeepSeek V4 vs Claude Opus 4.7 and DeepSeek V4 vs GPT-5.5 Pro.
Short version: Flash is the default for most production code, RAG, and tool-calling workloads in 2026. Pro is the right choice for hallucination-sensitive enterprise stacks, deep agentic loops, and frontier reasoning. Read on for the data behind that claim.
The two variants in 60 seconds
Both models share architecture: a hybrid attention stack combining Compressed Sparse Attention (CSA), DeepSeek Sparse Attention (DSA), and Heavily Compressed Attention (HCA), alongside Manifold-Constrained Hyper-Connections (mHC) and the Muon optimizer. Critically, Flash is not a distillation of Pro. It is a separate training run at smaller scale within the same family, post-trained with SFT, GRPO, and on-policy distillation. Both are text-only.
| Spec | V4-Pro | V4-Flash |
|---|---|---|
| Total / active params | 1.6T / 49B | 284B / 13B |
| Context window | 1M tokens | 1M tokens |
| Precision (deploy) | FP4 + FP8 mixed | FP8 (FP4 community quants) |
| Deployment weights | ~862 GB | ~158 GB |
| Reasoning modes | Non-Think / Think High / Think Max | High / xHigh |
| Training tokens | 32T+ | 32T+ |
| Modality | Text-only | Text-only |
| FLOPs at 1M ctx vs V3.2 | 27% | 10% |
| KV cache at 1M ctx vs V3.2 | 10% | 7% |
That last row is the quiet story: Flash isn't just smaller, it is more aggressively compressed at long context. At 1M tokens, Flash uses roughly a third of Pro's FLOPs and 70% of Pro's KV cache. That is why Flash can hold a 1M context conversation on a single Mac Studio while Pro requires an 8× H100 server.
V4 Pro vs Flash pricing
Pricing is where the architectural choice cashes out. Through May 31, 2026, DeepSeek is running a 75% promo on V4-Pro input and output tokens. Flash carries no promo, because it doesn't need one.
| Model | Input (sticker) | Input (Pro promo to 2026-05-31) | Output (sticker) | Output (promo) |
|---|---|---|---|---|
| V4-Pro | $1.74 / 1M | $0.435 / 1M (75% off) | $3.48 / 1M | $0.87 / 1M |
| V4-Flash | $0.14 / 1M | n/a | $0.28 / 1M | n/a |
Flash also offers a cache-hit input price of $0.014 / 1M (a 90% discount on cached input) and a community-reported off-peak nighttime discount of roughly 50% on the DeepSeek 1st-party API. At a typical 3:1 input/output blended workload, Flash lands around $0.17 per million blended tokens, while Pro at sticker is $2.17. Even at Pro's promo rate ($0.54 blended), Flash is still ~3.2× cheaper.
The output-token ratio matters most for reasoning models, because reasoning models emit a lot of output. Flash output is 12.4× cheaper than Pro at sticker, 3.1× cheaper at Pro's promo rate. When the promo ends, that ratio reverts to its full 12.4× gap.
Cost-per-task: the $113 vs $1,071 number
Sticker pricing only tells you what a token costs. The number that matters is what a completed task costs. Artificial Analysis publishes the cost to run their full Intelligence Index suite, which is a clean apples-to-apples test because both models run the same prompts to the same termination criteria.
| Model | AA Intelligence Index suite cost | AA Intelligence Index score |
|---|---|---|
| Claude Opus 4.7 | $4,811 | ~54 |
| DeepSeek V4-Pro | $1,071.28 | 52 |
| DeepSeek V4-Flash | $113 | 47 |
| DeepSeek V3.2 | $71 | ~42 |
That is roughly 9.5× cheaper to complete the same suite for 5 index points fewer. Flash also emits more output tokens than Pro on average (about 240M vs 190M for the suite) because it thinks longer to reach correct answers, yet the per-token price is so much lower that total spend still collapses by an order of magnitude.
The cross-lab comparison is even starker. Claude Opus 4.7 runs the suite at 4.5× the cost of Pro and 42× the cost of Flash for roughly 2 points of additional Intelligence Index. For most production workloads, that is not a defensible spend.
DeepSeek V4 Flash benchmarks vs Pro: where the gap is and isn't
Here is the consolidated benchmark table sourced from Artificial Analysis and the DeepSeek model cards. The right column is the percentage-point gap.
| Benchmark | V4-Pro Max | V4-Flash Max | Gap (pp) |
|---|---|---|---|
| AA Intelligence Index | 52 (#3 / 78) | 47 (#9 / 78) | 5 |
| MMLU-Pro | 87.5 | 86.2 | 1.3 |
| GPQA Diamond | 90.1 | 88.1 | 2.0 |
| LiveCodeBench Pass@1 | 93.5 | 91.6 | 1.9 |
| SWE-bench Verified | 80.6 | 79.0 | 1.6 |
| Codeforces rating | 3,206 | 3,052 | 154 (Elo) |
| Terminal-Bench 2.0 | 67.9 | 56.9 | 11.0 |
| HMMT 2026 Feb | 95.2 | 94.8 | 0.4 |
| IMOAnswerBench | 89.8 | 88.4 | 1.4 |
| Humanity's Last Exam (HLE) | 37.7 | ~34.6 | ~3.1 |
| SimpleQA-Verified | 57.9 | 34.1 | 23.8 |
| MRCR @ 1M (MMR) | 83.5 | 78.7 | 4.8 |
| BrowseComp | 83.4 | ~73 | 10.2 |
| AA-Omniscience accuracy | -10 | -23 | 13 |
| AA-Omniscience hallucination rate | 94% | 96% | 2 |
| GDPval-AA | 1554 | 1388 | 166 |
Coding: Flash is within 2 points on 2 of 3 benchmarks
This is the cluster that surprised most engineers. LiveCodeBench Pass@1 shows Flash at 91.6 vs Pro at 93.5, a 1.9-point gap. SWE-bench Verified sits at 79.0 vs 80.6, a 1.6-point gap. The Codeforces rating shows a real gap (3,206 vs 3,052, 154 Elo points), which matters for competitive-programming-style problems but rarely shows up in day-to-day backend work. For your TypeScript, Python, or Go agentic coding loops, Flash is the rational default unless you are in the top quartile of difficulty.
Reasoning and math: a wash within margin of error
MMLU-Pro: 1.3-point gap. GPQA Diamond: 2-point gap. HMMT 2026: 0.4 points. IMOAnswerBench: 1.4 points. These are within run-to-run variance. The only meaningful reasoning gap is on HLE (Humanity's Last Exam), where Pro pulls ahead by ~3 points. HLE specifically rewards graduate-level recall and chained inference, and that's where Pro's larger parameter count earns its keep.
Long-context: Pro has a modest edge
At 1M context retrieval (MRCR), Pro scores 83.5 and Flash 78.7, a 4.8-point gap. For most RAG and document-analysis workloads, 78.7 is comfortably above the threshold where retrieval quality starts hurting downstream tasks. Pro starts to matter when you're stuffing 600k+ tokens of legal contracts, codebases, or scientific literature into a single call and need every reference resolved correctly.
Agentic and tool use: the real Pro moat
This is where the Pro premium is justified. Terminal-Bench 2.0 shows an 11-point gap (67.9 vs 56.9) on multi-step shell-and-tool agentic tasks. BrowseComp shows a 10.2-point gap on web-search agent tasks. If your product is a multi-step agent that chains 8-15 tool calls per turn, that gap compounds: each step's error rate multiplies through the chain. The HN consensus from gertlabs frames Flash here too: "not a smart model on the first try, but it makes up for it over the course of a session." Flash recovers; Pro avoids needing to.
Knowledge and hallucination: Pro's clear win
The biggest single gap on the table is SimpleQA-Verified at 23.8 points (57.9 vs 34.1). Flash hallucinates roughly twice as often on factual recall. AA-Omniscience confirms this: 13 points of accuracy difference, with both models hallucinating at near-ceiling rates (94% vs 96%). For factual customer-facing systems, regulated industries, or any workflow where a confident wrong answer is worse than no answer, this is the deciding metric.
Speed and providers
Headline numbers from the DeepSeek 1st-party API at Max effort: Flash at 81.3 output tokens/sec, Pro at 35.6 t/s. Flash is roughly 2.3× faster on its native endpoint. But provider variance is enormous, and the speed sleeper is Pro on Fireworks.
| Provider | V4-Pro Output t/s | V4-Pro TTFT (s) |
|---|---|---|
| Fireworks | 169.3 | 28.06 |
| Together.ai | 48.3 | 0.99 |
| Novita | 36.0 | 123.41 |
| SiliconFlow | 35.8 | 124.21 |
| DeepSeek 1st party | 35.6 | 1.82 |
| DeepInfra (FP4) | 32.3 | 1.27 |
| Provider | V4-Flash Output t/s | V4-Flash TTFT (s) |
|---|---|---|
| Novita | 85.5 | 67.23 |
| SiliconFlow (FP8) | 83.7 | 68.42 |
| DeepSeek | 81.3 | 70.07 |
Fireworks running Pro at 169 t/s is faster than any Flash provider, with the trade-off of a 28-second TTFT. If you are building an interactive UX, Together.ai's sub-1-second TTFT on Pro is the more practical pick. For batch jobs where total throughput dominates, Fireworks Pro genuinely competes with Flash on wall-clock time.
OpenRouter daily volume tells its own story: 29.2B prompt tokens, 530M completion tokens, 533M reasoning tokens on Flash alone. That is dominant adoption inside two weeks of launch.
Local deployment
V4-Pro: datacenter-only
Pro requires 8× H100 or H200 server class hardware. At Q4 quantization the weights still occupy roughly 800GB. This is not homelab-feasible. Notably, DeepSeek positioned Pro as the first DeepSeek model optimized for Huawei Ascend 950 silicon, which is the geopolitical subtext of the launch.
V4-Flash: real local options
Flash has a hard floor of 90GB pooled memory. Below that, you'll page from disk and the model will hallucinate from incomplete context. Above that floor:
| Hardware | Approx. cost | Throughput (Q4 / Q4_K_M) |
|---|---|---|
| Mac Studio M4 Max 192GB | $5,999 | 25-35 t/s (MLX) |
| RTX PRO 6000 96GB | ~$8,500 | 45-60 t/s |
| Dual H100 80GB | ~$50,000 | 60-90 t/s |
Day-zero tooling: vLLM (FP4/FP8), SGLang, MLX (community Flash port), and llama.cpp (antirez fork). Community quants live at unsloth/DeepSeek-V4-Flash and tecaprovn/deepseek-v4-flash-gguf. AWQ/INT4 community ports are still in flight at the time of writing.
When to use V4 Flash
The default answer for production workloads in 2026 is Flash. Specifically:
- Bulk classification, summarization, and data labeling. The 12.4× output-cost difference is decisive when you're running millions of completions per day, and Flash's accuracy on structured tasks is within 1-2 points of Pro.
- Code autocomplete and mid-tier coding agents. Cursor- or Copilot-style replacements: gertlabs on HN argued Flash is the right pick here, citing speed and the fact that it self-corrects across a session.
- Customer service chatbots and RAG over enterprise corpus. 78.7 MRCR @ 1M is good enough for the vast majority of retrieval workloads, and the cache-hit price of $0.014 / 1M makes high-traffic chat economically viable.
- Tool-calling pipelines as a Haiku/Gemma 4 replacement. Flash's tool-calling reliability inside a single chain (1-3 tool calls) is essentially Pro-class. The gap only opens up at deeper agentic depth.
- Long-context analysis where 78.7 MRCR is acceptable. Most enterprise document QA, contract review, and codebase navigation falls here.
- Speed-prioritized interactive UX. 81 t/s on the 1st-party Flash API vs 36 t/s on Pro is a meaningful UX gap.
- Privacy-sensitive workloads. Mac Studio runnable means you can keep the workload entirely on-device for legal, healthcare, or finance use cases.
When to use V4 Pro
Pro is the right call when the cost of being wrong outweighs the cost of being expensive. Specifically:
- Multi-step agentic loops with chained tool calls. The 11-point Terminal-Bench gap compounds across long chains. If your agent does 8+ tool calls per turn, Pro's reliability earns back its premium.
- Web-search agents. The 10.2-point BrowseComp gap is real. Search agents read messy pages, parse contradictions, and need to recover from dead ends. Pro is materially better at this.
- Factual recall and world knowledge. The 23.8-point SimpleQA gap is the single largest delta on the table. If factual correctness is the product, you need Pro.
- Research-grade math beyond olympiad level. HLE shows a ~3-point gap, and at frontier difficulty those points represent meaningful capability.
- Hallucination-sensitive enterprise workflows. The 13-point AA-Omniscience accuracy gap matters in regulated industries, legal review, medical research, and financial analysis.
- Speed-critical frontier reasoning. Pro on Fireworks at 169 t/s is competitive with Flash on wall-clock time while delivering Pro-tier accuracy.
Real-world reception
The community read on V4 has been remarkably consistent across independent voices. Simon Willison summed it up: "DeepSeek V4 - almost on the frontier, a fraction of the price." Latent Space's digest emphasized two things: that "Flash@max ≈ Pro@high on reasoning tasks" and that V4 represents "legit 1M context for pennies."
"DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast." - HN user gertlabs
The same commenter was sharper on Pro: "The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently." That is one user's experience on the DeepSeek 1st-party endpoint, where Pro caps at 35 t/s and is heavily rate-limited; on Fireworks, the speed complaint goes away. The reliability complaint is a function of Pro's deeper agentic chain failures, which the benchmarks confirm.
On Flash's agentic behavior, gertlabs added a useful nuance: "Not a smart model on the first try, but it makes up for it over the course of a session." That matches the benchmark profile: Flash is more error-prone per step, but its self-correction in chat contexts is strong.
Decision tree
If you only have 30 seconds, here is the decision flow as prose:
Start: Is your workload factual recall or regulated/hallucination-sensitive? → Pro.
Otherwise, does your agent chain 8+ tool calls per turn or do web-search-heavy research? → Pro.
Otherwise, is your workload code autocomplete, RAG, classification, summarization, single-step tool calls, or chat? → Flash.
Edge case: Need frontier reasoning at low latency? → Pro on Fireworks (169 t/s, 28s TTFT) for batch; Pro on Together.ai for interactive (sub-1s TTFT).
Edge case: Need local / on-device? → Flash on Mac Studio M4 Max 192GB minimum. Pro is datacenter-only.
What this means for engineering teams
The economic shift is the headline. A capability that cost $1.07K per Intelligence Index suite run on Pro now costs $113 on Flash for 5 fewer points. Compared to Western frontier models, Flash is roughly 42× cheaper than Claude Opus 4.7 and well under 10% of GPT-5.5 Pro's cost on equivalent suites. That is not a marginal saving; it changes which products are economically viable to build.
For engineering leaders, this means three concrete things. First, your default model for new internal tooling should be Flash, with Pro reserved for the specific workloads where its moat justifies the spend. Second, your RAG and tool-calling stacks should be re-tested against Flash before your next contract renewal with a closed-model vendor. Third, privacy-sensitive workloads now have a credible local option at $6K-$8K of hardware, which makes on-device deployment defensible to compliance teams in healthcare, legal, and finance.
If you're staffing a team to actually integrate these models, Codersera connects companies with vetted senior engineers who have shipped LLM-backed systems in production. Whether you need a Python developer for ML pipelines, a TypeScript developer for agentic frontends, a Go developer for high-throughput inference proxies, a Node.js developer for orchestration layers, or a Rust developer for performance-critical model serving, you can scope and start within days. Learn more about our services, why teams choose us, or browse our AI engineering blog for more deep dives.
FAQ
Is V4 Flash distilled from V4 Pro?
No. Both models are independent training runs within the same architecture family. They share the same hybrid attention design (CSA + DSA + HCA), the same Muon optimizer, and the same 32T+ token training mix, but Flash is not produced by distilling Pro. This is confirmed in the DeepSeek model cards and reiterated in the AINews / Latent Space coverage.
How much does V4 Flash actually cost in production?
At sticker, $0.14 / 1M input and $0.28 / 1M output. With cache hits, input drops to $0.014 / 1M (90% off). Off-peak windows on the DeepSeek 1st-party API are reported to discount Flash by another ~50%. For a typical 3:1 input/output workload, that's about $0.17 per million blended tokens. The full Artificial Analysis Intelligence Index suite costs $113 to run on Flash.
What hardware do I need to run V4 Flash locally?
The hard floor is 90GB of pooled memory. Practical configurations: Mac Studio M4 Max 192GB ($5,999, 25-35 t/s on MLX Q4_K_M), RTX PRO 6000 96GB (~$8,500, 45-60 t/s), or dual H100 80GB (~$50K, 60-90 t/s). Below 90GB you'll page from disk and accuracy collapses.
Can I run V4 Pro locally?
Realistically, no. Pro requires 8× H100 or H200 class hardware. At Q4 quantization the weights still occupy roughly 800GB. This is a datacenter workload.
How does V4 Pro compare to Claude Opus 4.7 and GPT-5.5 Pro?
Pro is roughly 2 Intelligence Index points behind Opus 4.7, at 4.5× lower cost. We've covered the full breakdowns in DeepSeek V4 vs Claude Opus 4.7 and DeepSeek V4 vs GPT-5.5 Pro.
Which provider should I use for V4 Pro?
For batch throughput, Fireworks at 169 t/s is unmatched. For interactive use, Together.ai's sub-1-second TTFT is the right pick. The DeepSeek 1st-party endpoint is the cheapest but is rate-limited and slower.
Does V4 Flash hallucinate more than V4 Pro?
Yes. SimpleQA-Verified shows a 23.8-point gap (57.9 vs 34.1), and AA-Omniscience shows a 13-point accuracy gap. Both models hallucinate at near-ceiling rates on AA-Omniscience (94% Pro vs 96% Flash), so for hallucination-sensitive workloads, Pro is the safer pick but neither is a substitute for grounded retrieval.
Is the V4 Pro 75% promo permanent?
No. The promo runs through May 31, 2026. After that, Pro reverts to $1.74 / 1M input and $3.48 / 1M output, which restores the full 12.4× cost ratio between Pro and Flash.
When should I just use V4 Flash for everything?
If your workload is code autocomplete, RAG, classification, summarization, single-step tool calls, or chat, and your hallucination tolerance is moderate, Flash is the rational default. The 5-point Intelligence Index gap rarely shows up in practice for these workloads, and the cost difference is decisive.
Does V4 support images or audio?
No. Both Pro and Flash are text-only. If you need vision or audio, you'll need a multimodal model from a different family.
Sources and further reading
- DeepSeek V4-Pro on Hugging Face
- DeepSeek V4-Flash on Hugging Face
- DeepSeek V4 collection
- DeepSeek API pricing
- DeepSeek V4 release notes
- Artificial Analysis - V4-Pro
- Artificial Analysis - V4-Flash
- AA - V4-Pro providers
- AA - V4-Flash providers
- AA - DeepSeek is back among the leading open-weights models
- OpenRouter V4-Flash
- OpenRouter V4-Pro
- Hacker News - V4 main thread
- Hacker News - V4 technical paper
- Hacker News - Day 0 SGLang
- Latent Space digest
- Simon Willison on DeepSeek V4
- compute-market local hardware guide
- InsiderLLM V4 Pro vs Flash guide
- BuildFastWithAI - DeepSeek V4 Flash review
- OfficeChai V4 benchmarks and pricing
- DataCamp DeepSeek V4 article
- BenchLM V4-Flash
- NVIDIA - Build with DeepSeek V4 on Blackwell
For more LLM and AI engineering deep dives, see the Codersera blog, the AI tag, or our FAQs on engaging engineering talent.