DeepSeek V4: The Complete Guide (2026)

The definitive 2026 guide to DeepSeek V4 Pro and V4 Flash: architecture, benchmarks, pricing, hardware, API integration, and migration from Claude or GPT.

Updated 01 May 2026 • 14 min read

Last updated: May 1, 2026. We refresh this guide whenever DeepSeek ships a V4 patch, the major IDEs change their integration, or the pricing moves.

DeepSeek V4 is the open-weights model that quietly reset the price-performance frontier when it shipped on April 24, 2026. Two variants — V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active) — both ship with a 1M token context window, MIT licensing, and pricing that lands V4 Pro at roughly 1/7th the cost of Claude Opus 4.7 and 1/6th the cost of GPT-5.5 on coding workloads. This page is the single landing surface we point engineering teams to when they need to evaluate, deploy, or migrate to V4. Every section links to a deep-dive when you need to go further.

TL;DR — Should you care?

If you ship code with an LLM in the loop: yes. V4 Pro tops LiveCodeBench at 93.5, hits Codeforces ELO 3206 (ahead of GPT-5.5 at 3168), and is statistically tied with Claude Opus 4.7 on SWE-bench Verified (80.6 vs 80.8) — at a small fraction of the cost.
If you self-host: V4 Flash is the practical target. Native FP4+FP8 mixed quantization runs on a single H100 (tight) or comfortably on 4×H200. V4 Pro requires 8×H100 minimum at FP8.
If you're paying Anthropic or OpenAI for coding agents: the migration is a one-flag change for OpenAI-SDK clients. Promo pricing through May 31, 2026 makes V4 Pro 5× cheaper than its post-promo price — start the eval now.
The catch: the multi-turn reasoning_content 400 error is real and breaks every popular client on first contact. Fix exists; details below.
Heads up: the legacy deepseek-chat and deepseek-reasoner aliases retire July 24, 2026 15:59 UTC. If your code hard-codes them, plan the cutover.

What is DeepSeek V4?

DeepSeek V4 is the fourth-generation flagship from DeepSeek AI, released as a preview on April 24, 2026. It supersedes the V3.2-Exp branch and effectively replaces the R1 reasoning line — V4's optional thinking mode covers what R1 was used for. There are two production variants plus base-model variants for fine-tuning:

Variant	Total params	Active per token	Context	License	Best for
V4 Pro	1.6T	49B	1M tokens	MIT	Frontier coding agents, complex reasoning, long-horizon tasks
V4 Flash	284B	13B	1M tokens	MIT	Cheap inference, IDE integration, high-throughput batch, self-hosting

If you're choosing between them, the Pro vs Flash comparison walks through cost-per-task math on real workloads. For the official release breakdown, see DeepSeek V4: Full Release Breakdown and V4 Is Here: Full Specs, Benchmarks, and API Guide.

Architecture: what changed vs V3.2

V4 keeps the Mixture-of-Experts backbone DeepSeek has been refining since V2 but introduces three load-bearing changes:

DeepSeek Sparse Attention (DSA). A hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Net effect at 1M context: roughly 27% of the per-token FLOPs and 10% of the KV cache footprint versus V3.2. This is what makes a 1M context window economically viable rather than a marketing bullet.
Three reasoning modes. Non-Think, Think High, Think Max. Default is Think enabled. Unlike R1, V4 supports tool calls in thinking mode — previously you had to choose between reasoning and tools.
Manifold-Constrained Hyper-Connections (mHC) for training stability. Mostly relevant if you're fine-tuning; doesn't change inference behavior.

If you're trying to decide whether to upgrade an existing V3.2 deployment, the V4 vs V3.2 changes guide walks through the breaking changes (mostly API-level: reasoning_content behavior, prefix cache hashing). For the family-wide view across the whole DeepSeek lineup, the V3 vs V4 deep dive covers architecture, benchmarks, and pricing side-by-side.

Benchmarks (the real numbers)

Numbers below are from DeepSeek's official model card on HuggingFace, cross-validated against ArtificialAnalysis where independent evals exist. AIME, MATH-500, and IFEval are not in the official card and we've omitted them rather than quote secondary sources.

Benchmark	V4 Pro (Max)	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	80.6	80.8	—	—
SWE-bench Pro	55.4	64.3	58.6	54.2
LiveCodeBench	93.5	88.8	—	91.7
Codeforces ELO	3206	—	3168	—
GPQA Diamond	90.1	94.2	93.6	—
MMLU-Pro	87.5	—	—	—
GSM8K	92.6	—	—	—
TerminalBench 2.0	67.9	—	—	—
BrowseComp	83.4	79.3	84.4	—
SimpleQA-Verified	57.9	—	—	75.6
MRCR (1M long-context)	83.5	—	—	—
CorpusQA (1M long-context)	62.0	—	—	—

The honest read: V4 Pro leads on coding-style benchmarks (LiveCodeBench, Codeforces) and is essentially tied with Claude Opus 4.7 on SWE-bench Verified. It loses on pure factual recall (SimpleQA-Verified, where Gemini 3.1 Pro dominates) and on hard multi-step reasoning (GPQA Diamond, where Claude leads). On SWE-bench Pro — the harder agent-style coding eval — V4 Pro is meaningfully behind Claude Opus 4.7 (55.4 vs 64.3). For long-horizon, tool-heavy agentic loops, Claude is still the more reliable choice; for raw coding velocity at scale, V4 Pro wins.

For deeper comparisons see DeepSeek V4 Pro Review, V4 vs Claude Opus 4.7 head-to-head, V4 vs GPT-5.5 and GPT-5.5 Pro, and the broader V4 vs Qwen, Kimi, MiniMax, GPT, Claude comparison.

Pricing

API pricing on api.deepseek.com (USD per 1M tokens). Two columns because there's a launch promo running through May 31, 2026 15:59 UTC:

Model	Promo (until 2026-05-31)			Standard (post-promo)
	Cache hit	Cache miss	Output	Cache hit	Cache miss	Output
V4 Pro	$0.0036	$0.435	$0.87	$0.0145	$0.145	$1.74
V4 Flash	$0.014	$0.14	$0.28	$0.014	$0.14	$0.28
V3.2-Exp	$0.027	$0.27	$1.10	—	—	—

For comparison: Claude Opus 4.7 is $15 / $75 per million in/out, GPT-5.5 is $5 / $30. V4 Pro at standard pricing is roughly 35× cheaper on input and 17× cheaper on output than Opus 4.7. During the promo window it's effectively 35× cheaper on input and 86× cheaper on output. Neither of those ratios is a typo.

The cache-hit price is the load-bearing detail. DeepSeek dropped the cache-hit rate to 1/10 of launch pricing on April 26, 2026, which means agentic loops with stable system prompts get charged at one-tenth the listed input rate. Well-structured coding agents typically see >70% cache hit rates. For end-to-end pricing math including realistic cache-hit modeling, see How to Use DeepSeek V4 API: Complete Developer Guide.

Hardware requirements (self-hosting)

Numbers below assume vLLM 0.20.0+ with V4's native FP4+FP8 mixed quantization (FP4 for experts, FP8 for attention/norm/router). Community AWQ INT4 builds for V4 Flash exist with ~5% quality loss but aren't the official recipe.

Hardware	V4 Flash	V4 Pro	Notes
RTX 4090 / 5090 (24–32GB)	Doesn't fit	—	Even Flash needs ~158GB at native quant
Single H100 (80GB)	INT4 only, tight	Doesn't fit	Use AWQ build; expect ~5% quality loss vs native
Single H200 (141GB)	FP4+FP8 native, tight	Doesn't fit	Recommended dev target for Flash
2× H100 (160GB)	FP8 native, comfortable	Doesn't fit	Cleanest small-scale Flash deploy
4× H200 / GB200 NVL4 tray	Official vLLM recipe	Doesn't fit	Production Flash target
8× H100 (640GB NVLink)	Overkill	FP8 minimum, capped ~800K context	Production Pro entry point
8× H200 / DGX H200	Overkill	Official vLLM recipe, full 1M ctx	Recommended Pro deploy
8× B200 / B300	Overkill	Optimal — DP+EP	Highest throughput single-node
RTX PRO 6000 Blackwell	Broken (vLLM #40821)	Broken	SM120 unsupported in vLLM compile path
AMD MI325X	Community WIP	Community WIP	No official recipe yet
Mac M3 Ultra 192GB (MLX)	4-bit Flash works	No	Inference only; ~14 min prompt processing for 8K tokens on llama.cpp fork

The practical version: V4 Flash on 4×H200 (or a single H200 for dev) is the answer for most self-hosted production workloads. Pro is for teams that already have a multi-GPU node and want frontier quality on tap. Avoid RTX PRO 6000 entirely until the vLLM Inductor compile-path issue is resolved — it crashes on load.

For the full step-by-step on bringing Flash up locally — including the vLLM config, FP4+FP8 quantization, and the launch flag set — follow Run DeepSeek V4 Flash Locally: Full 2026 Setup Guide. For the architectural deep-dive on why Flash is the right choice for most teams, see DeepSeek V4 Flash: The Practical Deep Dive.

API integration

DeepSeek exposes both an OpenAI-compatible base URL (https://api.deepseek.com) and an Anthropic-compatible base URL (https://api.deepseek.com/anthropic). For most teams, OpenAI compat is the path of least resistance — your existing SDK works with two changes:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)

resp = client.chat.completions.create(
    model="deepseek-v4-pro",   # or "deepseek-v4-flash"
    messages=[
        {"role": "system", "content": "You are an expert Python engineer."},
        {"role": "user", "content": "Refactor this function to be O(n)..."},
    ],
    extra_body={"thinking": "high"},   # "off", "high", or "max"
)
print(resp.choices[0].message.content)
print(resp.choices[0].message.reasoning_content)  # may be populated

JSON mode (response_format={"type": "json_object"}), tool calling (tools=[...] with optional strict mode), and SSE streaming all work as expected. V4 supports tool calls in thinking mode — this was R1's biggest limitation and is now fixed.

The `reasoning_content` 400 error

This is the single biggest gotcha and the one that breaks every popular client on first contact. Worth understanding in detail.

When V4 returns a response with thinking mode enabled, the message object contains both content (final answer) and reasoning_content (the chain-of-thought). On the next request in the same conversation, you must round-trip reasoning_content back as part of the assistant message. If you only pass content, the API responds with 400 Bad Request: "reasoning_content must be passed back".

This is the opposite of how R1 worked. R1 rejected requests that included reasoning_content on prior turns. V4 requires it. Most clients still hard-code R1's behavior:

LiteLLM — issue #26395; strips reasoning_content when serializing message history. Fix is a one-line patch.
OpenCode — #24190, #24722; documented Anthropic-API workaround.
OpenClaw / Kilo Code / Hermes-agent — same root cause, separate issue trackers.
Cursor — breaks intermittently on long sessions; community proxy deepseek-cursor-proxy works around it.

There's also a tool-call wrinkle: even on assistant turns where there was no thinking (e.g., a pure tool call), some clients need to include reasoning_content: "" (empty string, not null) to satisfy V4's validator on the next turn. Our complete API developer guide walks through the proxy pattern that fixes this once and for all.

Model names on the API

deepseek-v4-pro — V4 Pro (current alias, stable)
deepseek-v4-flash — V4 Flash (current alias, stable)
deepseek-chat — legacy alias, currently routes to V4 Flash non-thinking mode. Retires July 24, 2026 15:59 UTC.
deepseek-reasoner — legacy alias, currently routes to V4 Flash thinking mode. Retires July 24, 2026 15:59 UTC.

Heads up: if your code hard-codes deepseek-chat or deepseek-reasoner, plan the cutover before July 24. Pricing already changed — many users were silently upgraded from V3.2 to V4 Flash on April 24.

Local deployment options

Stack	V4 Flash	V4 Pro	Notes
vLLM 0.20.0+	Production-ready (official recipe)	Production-ready (official recipe)	The reference deployment. Required flags: `--trust-remote-code --kv-cache-dtype fp8 --enable-expert-parallel --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4`
SGLang 0.4+	Production-ready	Production-ready	Better burst / short-request throughput than vLLM on agentic loops
TGI (HuggingFace)	Beta	Beta	1M context not officially supported yet
Ollama	Experimental	No	llama.cpp MoE routing not yet optimal; quality regressions vs vLLM
LM Studio	Yes	No	Single-user; not for production
MLX (Mac)	4-bit Flash works on M3 Ultra 128GB+	No	Inference only; 192GB recommended for headroom
llama.cpp	Experimental fork (antirez/llama.cpp-deepseek-v4-flash)	No	Targets MacBook 128GB w/ 2-bit experts

One sharp edge worth flagging: there is no standard HuggingFace Jinja chat template for V4. Naive tokenizer pipelines that assume one will silently produce malformed prompts. Use the encoding scripts that ship in each model's HuggingFace repo, not generic chat templates.

For most teams the choice is vLLM (familiar, broad ecosystem, official recipes) or SGLang (faster on coding-agent workloads where long prefix caches dominate). Our local setup guide walks through the vLLM path end-to-end with the correct flags.

IDE and tool integration

V4 works in every popular agentic-coding IDE; each one has its own integration sharp edge tied to the reasoning_content handshake.

Claude Code — official integration via the Anthropic-compat base URL. DeepSeek publishes a dedicated Claude Code guide. The pitch: 1/7 the cost of Opus 4.7 on the same Claude Code workflows.
Cursor — custom model via OpenAI-compat (Cmd+, → Models). Long sessions break on the multi-turn 400; community proxy deepseek-cursor-proxy patches it.
Cline / Roo Code / Kilo Code — native DeepSeek provider. Disable thinking mode for now or pin to a build with the round-trip fix.
Aider — works out of the box: aider --model deepseek/deepseek-v4-pro.
Continue.dev — OpenAI-compatible config. Same multi-turn caveat as Cursor.
OpenCode / OpenClaw / Hermes-agent — all currently affected by the 400 error in multi-turn agent loops. Watch each project's tracker for May 2026 patches.

Where V4 actually shines

Agentic coding loops

V4 Pro's combination of high LiveCodeBench (93.5), Codeforces 3206, tool-calls in thinking mode, and 1/7 Opus pricing makes it the first model where running a long-horizon coding agent is economically rational. At Claude Opus 4.7 prices, an 8-hour autonomous run typically costs $50–200; the same workload on V4 Pro promo lands at $1.50–6.

Codebase Q&A without RAG

V4's MRCR 1M score of 83.5 and CorpusQA 1M of 62.0 are strong enough that fitting a whole service-sized codebase (~750K tokens) into a single prompt is a viable replacement for a maintained RAG pipeline. Cache-hit pricing makes re-querying the same repo prefix cost roughly $0.003 per million tokens — cheap enough for routine workflows.

Long-document analysis

Legal review, log triage, multi-hour transcript analysis — anywhere the prior pattern was "summarize then re-summarize," V4 lets you keep the full corpus in context. Cache the document, query repeatedly cheaply.

Fine-tuning

NVIDIA shipped an official NeMo AutoModel recipe for V4 Flash on Blackwell GPUs on April 25, 2026 — handles V4-specific Q-LoRA / O-LoRA quirks. Pro fine-tuning needs a multi-node setup; most teams should fine-tune Flash and use Pro as a teacher.

High-throughput batch generation

V4 Flash at $0.14/$0.28 makes it the cheapest viable model for batch workloads (synthetic data, classification at scale, content reformatting). Self-hosted Flash on a 4×H200 node hits the API breakeven point at roughly 100M input tokens / month.

Known issues and limitations

Multi-turn thinking-mode 400 error (described above). Real production hazard.
vLLM #40821 / #41027 / #26211 — RTX PRO 6000 Blackwell (SM120) crashes on V4 load due to a torch.compile / Inductor auto_functionalized assertion. No fix shipped. Avoid this hardware until resolved.
Server busy (HTTP 503/429) at peak hours on the official API. Rate limits are dynamic and unpurchasable; production users front the API with retries, exponential backoff, and Flash fallback. Third-party hosts (Together AI, DeepInfra, NVIDIA NIM, OpenRouter) are common backstops.
Tool-calls leaking into content — intermittent regression where the model emits tool calls as plain text instead of via the tool_calls field. Defensive parsing required.
Multimodal gap — V4 is text-only. The separate vision variant lags GPT-5.5 native and Gemini 3.1 Pro substantially. Not a fit for image-heavy agents.
Raw factuality — SimpleQA-Verified 57.9 vs Gemini 3.1 Pro 75.6. V4 hallucinates more on closed-book factual queries than the closed leaders.
Tool-use reliability on long-horizon agents still trails Claude Opus 4.7 (visible in the SWE-bench Pro gap: 55.4 vs 64.3). For ReAct-style agents this is fine; for tool-heavy structured workflows, benchmark before committing.
402 Insufficient Balance — not retryable; top up.

Migrating from Claude or GPT to V4

Swap base URL and model name. Existing OpenAI-SDK code keeps working — point at https://api.deepseek.com/v1, change model= to deepseek-v4-pro or deepseek-v4-flash.
Re-tune system prompts. V4 follows instructions slightly more literally than Claude. Prompts that rely on Claude's "infer my intent" behavior need to be made explicit.
Decide thinking-mode policy. Default off for chat-style products (latency); default on for coding agents (quality). Use thinking: "max" for the hardest tasks.
Implement reasoning_content round-tripping if you have any multi-turn flows. Skipping this will cost you a day of debugging.
Re-run your eval suite. Don't trust the published benchmarks — your workload is different. Especially scrutinize tool-heavy and long-horizon scenarios.
Plan for self-hosting if your token volume justifies it. Crossover with the API at standard pricing is roughly $30K/month in spend; with promo pricing it's much higher, so most teams should start on the API.

The migration is genuinely the kind of work where having an engineer who has done it before saves weeks. Hiring a Codersera-vetted Python or ML engineer typically gets you someone who has already shipped this exact migration.

Which DeepSeek model should you use?

Coding agents, frontier quality matters most: V4 Pro (review)
Coding agents, cost matters most: V4 Flash (deep dive)
Reasoning-heavy non-coding work: V4 Pro with thinking: "max" — there's almost no reason to use R1 anymore
Already on V3.2, deciding whether to upgrade: read the V3.2 vs V4 changes guide — short answer: yes, unless you have V3.2-specific tokenizer hard-coding
Comparing against the closed-source frontier: see V4 vs Qwen, Kimi, MiniMax, GPT, Claude

Frequently asked questions

Is DeepSeek V4 free to use?

The weights are MIT-licensed and free. The hosted API is paid, but with launch-promo pricing through May 31, 2026 it's an order of magnitude cheaper than competing frontier APIs.

Is DeepSeek V4 actually open source?

Yes. Both Pro and Flash are MIT-licensed with weights on HuggingFace at deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash. Base variants for fine-tuning ship alongside. You can self-host commercially without contacting DeepSeek.

What's the difference between V4 Pro and V4 Flash?

Pro is 1.6T total / 49B active and tops the coding benchmarks. Flash is 284B total / 13B active, slightly behind on quality, dramatically cheaper to self-host. The Pro vs Flash comparison walks through cost-per-task math.

Can DeepSeek V4 run on a single GPU?

V4 Flash on a single H100 (80GB) at INT4 — yes, but tight. Single H200 (141GB) is the cleanest single-GPU target. V4 Pro requires multi-GPU regardless of quantization. RTX 4090 / 5090: no.

Can DeepSeek V4 run on a Mac?

V4 Flash on M3 Ultra 192GB via MLX (4-bit) — yes, but only for inference and prompt processing is slow (~14 minutes for 8K tokens on the llama.cpp fork). Practical only for development, not production. V4 Pro: no.

How does V4 compare to Claude Opus 4.7 for coding?

SWE-bench Verified: tied (80.6 vs 80.8). LiveCodeBench: V4 ahead (93.5 vs 88.8). Codeforces ELO: V4 ahead (3206 vs not published for Opus). SWE-bench Pro (long-horizon): Opus ahead (64.3 vs 55.4). Pricing: V4 ~17–35× cheaper. Full breakdown in V4 vs Claude Opus 4.7.

How does V4 compare to GPT-5.5?

Very close on most benchmarks; V4 Pro slightly ahead on Codeforces, slightly behind on GPQA. V4 is cheaper and open-weight; GPT-5.5 has better tool-use reliability and multimodal. See V4 vs GPT-5.5 same-week showdown.

What is the context window of DeepSeek V4?

1,048,576 tokens (1M) on both variants natively. NIAH retrieval >95% to 900K. Reasoning across the full 1M weakens past ~500K, but for codebase Q&A and long-document tasks this is fine.

How much does the DeepSeek V4 API cost per million tokens?

Promo (until 2026-05-31): V4 Pro $0.435 input / $0.87 output / $0.0036 cache hit. Flash $0.14 / $0.28 / $0.014. Standard (post-promo): V4 Pro $0.145 / $1.74 / $0.0145. Flash unchanged.

Why am I getting a 400 `reasoning_content` error?

Multi-turn conversations with thinking mode require you to round-trip reasoning_content back as part of the assistant message. Most clients (LiteLLM, OpenCode, OpenClaw, Cursor) don't, which causes the 400. The API developer guide has the full fix.

How do I use V4 in Cursor or Claude Code?

Claude Code: official integration via https://api.deepseek.com/anthropic. Cursor: custom model via OpenAI-compat at https://api.deepseek.com/v1 — be aware of long-session breakage and use the community proxy if needed.

Does V4 support tool calls and JSON mode?

Yes to both. Strict-mode tool calls supported. Critically, V4 supports tool calls in thinking mode — R1's biggest limitation is now gone.

When will `deepseek-chat` and `deepseek-reasoner` retire?

July 24, 2026 15:59 UTC. They currently alias to V4 Flash non-thinking and thinking respectively. Migrate to explicit deepseek-v4-flash or deepseek-v4-pro now.

Can V4 replace RAG with its 1M context window?

For most service-sized codebase Q&A: yes. MRCR 1M of 83.5 and cache-hit pricing make it economical. For multi-document corpora larger than 1M tokens or strict citation requirements, RAG still wins.

Does V4 support image / multimodal input?

Not in the main V4 release. There's a separate vision variant but it lags GPT-5.5 and Gemini 3.1 Pro substantially. For multimodal use cases, look elsewhere.

How do I fine-tune DeepSeek V4?

Use NVIDIA's official NeMo AutoModel recipe for V4 Flash on Blackwell GPUs. Pro fine-tuning needs multi-node — most teams fine-tune Flash and use Pro as a teacher.

Next steps

If you're evaluating V4 for production:

Read the V4 Pro review for the full technical assessment.
Run the local setup guide on a development H200 to validate quality on your workload.
Use the API developer guide to wire V4 into a non-trivial agent and pressure-test the reasoning_content handshake.
If you're migrating from Claude or GPT and don't have an engineer who has done this before, hire a Codersera-vetted Python or ML engineer. Most of our network has shipped at least one production V4 deployment.