DeepSeek V4 Flash: Benchmarks, Pricing & Flash vs Pro

Quick answer. Yes, for most teams DeepSeek V4 Flash is worth it. It scores 47 on the Artificial Analysis Intelligence Index (#10 of 85), within ~1-2 points of V4 Pro on coding, at $0.14 input / $0.28 output per million tokens and $0.0028 cached. Default to Flash; escalate to Pro only for long-horizon agents and heavy factual recall.

This is a benchmarks-and-pricing review of DeepSeek V4 Flash, written to answer one question: should you actually build on it, and how does it compare to V4 Pro on the numbers that affect your bill? It is deliberately not an install walkthrough. If you want to run Flash on a Mac Studio or a GPU rig, follow our dedicated DeepSeek V4 Flash local setup guide instead. Here we stay on benchmarks, pricing, and the Flash-vs-Pro decision.

DeepSeek shipped V4 on April 24, 2026 as a two-model family: V4 Pro (1.6T total / 49B active, the leaderboard model) and V4 Flash (284B total / 13B active, the cheap-and-fast one). Most launch coverage was a Pro story. But the model that changes how most engineering teams should think about LLM spend is Flash, because the price-to-capability ratio is the most aggressive on the open-weights board this year. Both are MIT-licensed with full open weights on Hugging Face, and both expose a 1M-token context window.

The verdict up front: on the Artificial Analysis Intelligence Index v4.0 (a neutral third-party benchmark), Flash scores 47 versus Pro's 52 — a 5-point gap — while costing roughly 9.5x less to run the same evaluation suite ($113 vs $1,071, per Artificial Analysis). For coding, classification, RAG, and most product workloads, that 5-point gap is not worth a 9.5x bill. For long-horizon agents and pure factual recall, it is. The rest of this review shows the numbers behind that call.

Is DeepSeek V4 Flash worth it?

For the large majority of LLM-backed product work in 2026, yes. The case rests on three measured facts:

Intelligence is top-tier for its size. AA Intelligence Index 47, ranked #10 of 85 models tracked — well above the ~30 median for comparable models (source: Artificial Analysis, neutral).
Coding quality tracks Pro closely. SWE-bench Verified 79.0 (Pro 80.6) and LiveCodeBench Pass@1 91.6 (Pro 93.5). At IDE-task scale the 1-2 point delta is effectively invisible.
The price gap is meaningful. $0.14 / $0.28 sticker, $0.0028 cached input — versus Pro's permanent $0.435 / $0.87 (DeepSeek made the previous 75% discount permanent on 2026-05-22). Flash output is ~3.1x cheaper than Pro's permanent rate.

Where it is not worth it: workloads dominated by chained multi-step tool use, web-browsing agents, or hallucination-sensitive factual lookups. Those are exactly the tasks the benchmark gap concentrates in, and we map them out in the decision matrix below.

What is DeepSeek V4 Flash?

Flash is the second model in DeepSeek's V4 family, released April 24, 2026. It is not a distillation of Pro — it is its own pretraining run in the same architecture family, with the same MIT license and full open weights published on Hugging Face. The headline specs (vendor model card):

284B total parameters, 13B active per token (Mixture-of-Experts). Pro is 1.6T total / 49B active.
1M-token context window, identical to Pro.
Reasoning modes: High and Max (community-tagged Flash-Max in evaluation tables). A non-reasoning mode also exists and scores materially lower (AA Index 36 vs 47).
Text-only. No image input.

The name is misleading. "Flash" in Google's and OpenAI's product lines historically meant a small distilled model with sharply lower capability. DeepSeek V4 Flash is a smaller-active-params MoE in the same family that, on most benchmarks, sits within a few points of Pro. The exceptions — agentic tool-use, browsing, factual recall — are the whole reason a routing strategy exists, and they are covered below.

What does DeepSeek V4 Flash cost?

Pricing is where Flash earns its place. These are the first-party DeepSeek API rates per million tokens, verified against the official DeepSeek API pricing page (vendor-published, current as of May 2026).

Model / tier	Input (cache miss)	Input (cache hit)	Output	Blended (3:1)
V4 Flash	$0.14	$0.0028 (98% off)	$0.28	~$0.18
V4 Pro (permanent since 2026-05-22)	$0.435	$0.003625	$0.87	~$0.54

The cache-hit lane is the headline. DeepSeek cut all-model cache-hit input pricing to 1/10 of launch (effective 2026/04/26), so Flash cached input is now $0.0028 per million tokens — a 98% discount on an already-cheap sticker. Any chat, agent, or RAG workload with a stable system prompt and reusable context collapses input cost to near zero, automatically; the API manages the cache for you.

Put it in workload terms. A coding agent that emits ~50,000 output tokens per task costs about $0.014 per task on Flash. The same task on Claude Opus 4.7 is roughly $1.25 — close to a 90x output-cost difference (figures via BuildFastWithAI's review; treat as illustrative, your token mix will vary). Even against Pro, Flash output is ~3.1x cheaper at Pro's permanent price ($0.87/M vs $0.28/M, after DeepSeek made the 75% discount permanent on 2026-05-22).

One caveat, labeled clearly: DeepSeek historically offered an extra off-peak (Beijing-night, ~16:30-00:30 UTC) discount on V3/R1. Whether that schedule extends to V4 Flash is unverified — community reports of ~$0.07/$0.14 nighttime rates exist but DeepSeek has not confirmed it for the V4 line. Do not build billing assumptions on it; check the official pricing page before relying on off-peak rates.

How does V4 Flash score on benchmarks?

Two benchmark families matter here: Artificial Analysis (neutral, third-party, runs its own evaluations) and the DeepSeek model card (vendor-reported). Both are labeled in the table.

Benchmark	V4 Flash (Max)	V4 Pro (Max)	Source
AA Intelligence Index v4.0	47 (#10 / 85)	52 (#3 / 85)	Artificial Analysis (neutral)
Cost to run full AA Index suite	$113	$1,071	Artificial Analysis (neutral)
Output speed (1st-party API)	~103 t/s	~29 t/s	Artificial Analysis (neutral)
Time to first token	~1.1 s	~1.9 s	Artificial Analysis (neutral)
SWE-bench Verified	79.0	80.6	Model card / aggregators (vendor)
LiveCodeBench Pass@1	91.6	93.5	Model card / aggregators (vendor)
GDPval-AA (agentic)	1388	1554	Artificial Analysis (neutral)
AA-Omniscience accuracy	-23 (96% halluc.)	-10 (94% halluc.)	Artificial Analysis (neutral)

Read the table as a decision aid, not a scoreboard. The neutral AA Index puts Flash at 47 and Pro at 52. Five points sounds large until you decompose where the gap lives:

Coding (the workload most teams actually run): 1.6pp on SWE-bench Verified, 1.9pp on LiveCodeBench. At IDE-replacement scale this delta does not show up in shipped code quality. Multiple independent reviews land on the same conclusion: benchmark Flash first, escalate to Pro only when a specific task type fails your eval.
Agentic / GDPval-AA: 1388 vs 1554. This is where the gap is real. Long-horizon, many-tool-call loops accumulate error on Flash faster than on Pro.
Factual recall / hallucination: AA-Omniscience -23 vs -10, 96% vs 94% hallucination. Flash is measurably more willing to confidently make things up. Hallucination-sensitive enterprise workflows should route to Pro or add a verification layer.

Independent reviewers summarize it the same way: for typical IDE and code-generation work the gaps are small and the cost advantage is massive; the gap widens on hard multi-file refactors that require holding complex state and on chained agentic tool use.

How fast is V4 Flash?

Speed is the second pillar of the Flash pitch and it is not close. Artificial Analysis measures Flash (Max) at roughly 103 output tokens/sec on the first-party API versus Pro (Max) at about 29 t/s — Flash is ~3.5x faster end-to-end on identical infrastructure, with ~1.1s TTFT vs Pro's ~1.9s. For latency-sensitive UX (inline IDE suggestions, voice, live writing) that difference is the gap between "feels live" and "server is thinking." Flash is also notably verbose, so set output token limits if you are cost-sensitive.

Flash vs Pro: which should you use?

The consensus pattern across DeepSeek's own positioning and every independent review is a router: default to Flash, escalate to Pro on validation failure, and reserve a top-tier frontier model (Claude Opus 4.7 or similar) for the hardest ~5% of tasks. Concrete decision matrix:

Workload	Recommended	Why
IDE coding agent / Copilot replacement	Flash	1-2pp behind Pro on SWE-bench/LiveCodeBench; ~12x cheaper output
Bulk classification / summarization / labeling	Flash	~103 t/s and $0.14/$0.28 sticker; throughput economics decisive
RAG / chatbot with stable system prompt	Flash	$0.0028 cached input makes blended input near-free
Long-context document analysis (1M)	Flash	Same 1M window as Pro at a fraction of the cost
Latency-sensitive interactive UX	Flash	~3.5x faster output than Pro; lower TTFT
Privacy / on-prem / air-gapped	Flash	Fits desk-class hardware; Pro needs a datacenter (see setup guide)
Long-horizon agents (many chained tool calls)	Pro	GDPval-AA 1554 vs 1388; compounding error matters
Web-browsing / research agents	Pro	Better cross-turn synthesis; Flash weaker on browsing
Hallucination-sensitive enterprise lookups	Pro (+ verification)	AA-Omniscience -10 vs -23; fewer confident fabrications
Hard multi-file refactors holding type-level state	Pro	More reliable maintaining invariants across files

Cost-per-task is the clincher. Artificial Analysis spent $113 on Flash vs $1,071 on Pro to complete the same Intelligence Index suite — a 9.5x cost reduction for a 5-point Index drop. If you are not paying for the last 5 points of intelligence on a given task, you should not be paying 9.5x for it. Run both behind a router, send the small slice where Pro's edge is real to Pro, and let the rest ride Flash.

Companion guide

For the full picture across the V4 lineup — architecture, deployment patterns, and how it stacks up against GPT-5.5 and Claude Opus 4.7 — see our DeepSeek V4 complete guide (2026).

How do you call V4 Flash from code?

The DeepSeek API is OpenAI-compatible: endpoint https://api.deepseek.com/v1/chat/completions, model name deepseek-v4-flash, Bearer auth. Prompt caching is automatic when your prompt prefix repeats; the usage object reports how much was billed at the cached rate. This is the minimal integration — for a full local deployment, the setup guide covers MLX/vLLM/llama.cpp.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)

SYSTEM_PROMPT = """You are a senior code reviewer. Review the diff for
correctness, security, and performance issues. Return JSON:
{ issues: [{severity, file, line, message}] }"""

def review(diff: str) -> str:
    r = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": diff},
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )
    u = r.usage
    hit = getattr(u, "prompt_cache_hit_tokens", 0)
    miss = getattr(u, "prompt_cache_miss_tokens", 0)
    print(f"cache hit {hit} / miss {miss} / out {u.completion_tokens}")
    return r.choices[0].message.content

The trick to maximize cache hits: keep the system prompt byte-stable and put long stable context (style guide, docs, schema) before short variable context (the diff, the user message). DeepSeek's caching is prefix-based; any change to the front of the prompt invalidates the cached prefix.

What do engineers say about V4 Flash?

Independent reception lands consistently: Flash is the model teams are moving real production traffic to, and the recommended posture is "Flash by default, Pro as escalation."

Artificial Analysis: Flash is "well above average among comparable models" on the Intelligence Index, and "back among the leading open-weights models" alongside Pro (neutral benchmark house).
Independent reviews (BuildFastWithAI, evolink, andrew.ooo): "benchmark Flash first, escalate to Pro only when evaluation shows Flash quality is insufficient for a specific task type."
Simon Willison on the V4 launch: "DeepSeek V4 — almost on the frontier, a fraction of the price" (independent commentary).

One housekeeping note for migrations: DeepSeek's legacy deepseek-chat and deepseek-reasoner aliases retire after July 24, 2026. If you are on the old aliases, V4 Flash is the natural default replacement; benchmark it before the cutoff.

FAQ

Is DeepSeek V4 Flash worth it over V4 Pro?

For most workloads, yes. Flash is ~1-2 points behind Pro on coding benchmarks and 5 points behind on the overall AA Intelligence Index, but roughly 9.5x cheaper to run the same evaluation suite. Default to Flash and escalate to Pro only for long-horizon agents, browsing-heavy work, and hallucination-sensitive factual tasks.

How much does DeepSeek V4 Flash cost per million tokens?

$0.14 input (cache miss) and $0.28 output. Cache-hit input is $0.0028 (a 98% discount, effective 2026/04/26). Off-peak nighttime pricing for V4 is community-reported but unconfirmed by DeepSeek.

What does V4 Flash score on the AA Intelligence Index?

47 in reasoning Max mode, ranked #10 of 85 tracked models — well above the ~30 median for comparable models. Non-reasoning mode scores 36. V4 Pro (Max) scores 52.

How does V4 Flash compare to Pro on coding?

SWE-bench Verified 79.0 (Pro 80.6) and LiveCodeBench Pass@1 91.6 (Pro 93.5). The gap is 1-2 points and effectively invisible at IDE-task scale; it widens on hard multi-file refactors that require holding complex state.

Is V4 Flash fast enough for interactive use?

Yes. Artificial Analysis measures ~103 output tokens/sec on the first-party API (Pro is ~29 t/s) with ~1.1s time to first token. It is fast enough for inline IDE suggestions and voice UX, though it is verbose, so cap output tokens if cost matters.

When should I use Pro instead of Flash?

Use Pro for long-horizon agentic loops with many chained tool calls (GDPval-AA 1554 vs 1388), web-browsing agents, hallucination-sensitive enterprise lookups (AA-Omniscience -10 vs -23), and hard multi-file refactors. Default everything else to Flash.

Where do I learn how to run V4 Flash locally?

This review intentionally does not cover installation. Follow our dedicated DeepSeek V4 Flash local setup guide for hardware requirements, quantization choices, and MLX/vLLM/llama.cpp serving.

What is the verdict for engineering teams?

The right default for LLM-backed product in 2026 has shifted. DeepSeek V4 Flash is the model you build on; Pro (or a closed frontier model) is the one you escalate to. On the numbers that matter — neutral AA Intelligence Index 47, coding within 1-2 points of Pro, ~103 t/s, $0.14/$0.28 with $0.0028 cached input — Flash clears the bar for the large majority of workloads at roughly an order of magnitude less cost. The 5-point Index gap to Pro is real, but it concentrates in long-horizon agents and factual recall, which a router can isolate.

The harder part is execution: most of the savings live in caching, routing, and batching — engineering work that never shows up in a benchmark. If you are hiring vetted remote developers experienced with DeepSeek V4, LLM cost engineering, and production AI pipelines, codersera.com/hire matches you with senior, remote-ready talent in TypeScript, Python, Go, Rust, and Node who have shipped this kind of work before.

Sources and further reading

DeepSeek API pricing (vendor, primary)
DeepSeek V4 release notes (vendor, primary)
DeepSeek V4-Flash model card on Hugging Face (vendor)
Artificial Analysis: V4 Flash (neutral, third-party)
Artificial Analysis: V4 Pro (neutral, third-party)
Artificial Analysis: DeepSeek is back among the leading open-weights models (neutral)
BuildFastWithAI: V4 Flash review (independent)
Evolink: DeepSeek V4 API review, Flash vs Pro (independent)
andrew.ooo: V4 Pro vs V4 Flash, May 2026 (independent)
OpenRouter: V4 Flash pricing and benchmarks (aggregator)
Simon Willison on DeepSeek V4 (independent commentary)
Codersera: Run DeepSeek V4 Flash locally — full setup guide (companion install walkthrough)
Codersera: DeepSeek V4 complete guide (2026) (pillar)