DeepSeek V4-Pro 75% Price Cut Goes Permanent: What It Means for Developers (May 2026)

DeepSeek made its 75% V4-Pro discount permanent on May 22, 2026. Standing rates: $0.435/M input, $0.87/M output. Here is what changed, the new cost-per-quality math vs Claude Opus 4.7 and GPT-5.5, and the migration code.

Published 28 May 2026 • Updated 28 May 2026 • 12 min read

Quick answer. On May 22, 2026 DeepSeek made the 75% V4-Pro promotional discount its permanent standing price. V4-Pro is now $0.435/M cache-miss input, $0.003625/M cached input, and $0.87/M output — roughly 11.5x cheaper than GPT-5.5 on input and 34.5x cheaper on output. V4-Flash pricing is unchanged. The legacy deepseek-chat and deepseek-reasoner aliases retire July 24, 2026; switch to deepseek-v4-pro or deepseek-v4-flash explicitly before then.

For a year, DeepSeek's pricing playbook has been the same: announce a frontier-tier model, then ship a time-boxed promotional discount that anchors developer expectations and trains buyers on a new floor. V4-Pro followed the same script when it launched on April 24, 2026 — full list at $1.74 / $3.48 per million tokens, with a 75% promo discount running through May 31 at 15:59 UTC.

On May 22, DeepSeek pulled the rug out from under that pattern in the most useful way possible for everyone using the API. The team announced that the promotional rate would not roll back when the timer hit zero. Instead, the discounted price becomes the permanent list price. Input drops to $0.435/M, output to $0.87/M, cached input to $0.003625/M. There is no expiry.

That distinction matters more than the headline number. A discount that expires is a marketing event. A discount that does not expire is a market floor. DeepSeek has published a reference price every Chinese frontier lab will need to match, and put cost-rationalization pressure on OpenAI, Anthropic, and Google that none of them has answered as of this writing.

This guide covers what actually changed, the new cost-per-quality math against Claude Opus 4.7 / GPT-5.5 / Qwen3.7 Max, the rate-limit and concurrency profile to plan for, and the working code to migrate a Claude or OpenAI app over without rewriting your stack.

What exactly changed on May 22, 2026?

The change is narrow and surgical. Only one model's pricing was modified: DeepSeek V4-Pro. The full announcement:

Cache-miss input: $1.74 → $0.435 per million tokens (−75%)
Cached input: $0.0145 → $0.003625 per million tokens (−75%)
Output: $3.48 → $0.87 per million tokens (−75%)
Effective date: permanent, no expiry. Was originally set to expire 2026-05-31 15:59 UTC.

What did not change:

V4-Flash pricing stays at launch rates: $0.14/M input, $0.0028/M cached input, $0.28/M output.
The model itself. No weights update, no new checkpoint, no benchmark change. This is purely a pricing event.
Context window: still 1M tokens.
License: still MIT. Open weights still on HuggingFace at deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash.
API endpoints: still OpenAI-compatible at https://api.deepseek.com and Anthropic-compatible at https://api.deepseek.com/anthropic.

For context on the model itself, our DeepSeek V4-Pro review covers the architecture, the SWE-bench / GPQA / AIME scores, and the head-to-head performance — those numbers are unchanged. The pillar guide at DeepSeek V4: pro, flash, pricing and benchmarks gives the broader 2026 landscape including V4-Flash.

Why did DeepSeek make this permanent?

Three factors converged.

First, the cost structure underneath V4 supports it. Unlike V3.2 which ran primarily on Nvidia GPUs, the V4 family was architected from the start around Huawei Ascend 950 and 950PR accelerators. Add in the new hybrid Compressed Sparse Attention (CSA) plus Heavily Compressed Attention (HCA) — which the team reports needs only 27% of the per-token inference FLOPs and 10% of the KV-cache memory of V3.2 at 1M-token context — and DeepSeek's per-token serving cost is structurally lower than US labs paying retail H100 / H200 hours.

Second, the API price war has a strategic logic. Open weights mean any developer can self-host. The API has to compete with "download the file and run vLLM yourself." Anchoring the API at near-cost cuts off the self-host arbitrage and keeps the user funnel on DeepSeek's infrastructure where they own the telemetry, the upgrade path, and the future revenue.

Third, locking in the floor changes how every downstream platform plans. Cursor, Continue.dev, Claude Code wrappers, OpenRouter resellers — every tool that bundles DeepSeek into a paid plan has been pricing the May 31 expiry into their margins. Making the cut permanent unblocks 12-month enterprise commitments that wouldn't have been signed under "effective until the promo runs out."

What are the new rates compared to Claude Opus 4.7 and GPT-5.5?

The standing rate card across the frontier-adjacent tier as of May 28, 2026:

Model	Input ($/M)	Output ($/M)	Cached input ($/M)
DeepSeek V4-Pro	$0.435	$0.87	$0.003625
DeepSeek V4-Flash	$0.14	$0.28	$0.0028
Qwen3.7 Max	$2.50	$7.50	—
Claude Opus 4.7	$5.00	$25.00	—
GPT-5.5	$5.00	$30.00	—
GPT-5.5 Pro	$30.00	$180.00	—

A blended 1M-input + 1M-output workload (the common reference unit) costs:

V4-Pro: $1.305 (with 75% cached input, drops to about $0.88)
Qwen3.7 Max: $10.00 (7.7x)
Claude Opus 4.7: $30.00 (23x)
GPT-5.5: $35.00 (27x)
GPT-5.5 Pro: $210.00 (161x)

The gap on output tokens — where most chat dollars actually go — is even more striking. V4-Pro is 28.7x cheaper per output token than Opus 4.7, 34.5x cheaper than GPT-5.5, and 207x cheaper than GPT-5.5 Pro.

Where does V4-Pro actually win on cost-per-quality?

Pricing alone is a vanity number; what matters is whether the cheaper tokens are good enough for the workload. V4-Pro's published benchmarks (mostly third-party-verified by llm-stats.com and artificialanalysis.ai, some still internal claims) sit close enough to the frontier that the cost gap dominates for most production traffic:

SWE-bench Verified: 80.6%
LiveCodeBench Pass@1: 93.5
Terminal-Bench 2.0: 67.9%
SWE-Bench Pro: 55.4%
MMLU-Pro: 87.5%
GPQA Diamond: 90.1%
AIME 2025: 89.3
Codeforces ELO: 3206 (ahead of GPT-5.4 xHigh at 3168 and Gemini 3.1 Pro at 3052)

V4-Pro is the obvious choice when:

You are doing bulk classification, extraction, summarization, or RAG-style document Q&A. The quality gap vs Opus / GPT-5.5 is invisible in production at this kind of work, and the cost gap is 25-30x.
You need single-turn or short-horizon code generation. 80.6% SWE-bench Verified at $0.87/M output is a price-quality point nothing else on the market touches.
You are processing long documents and can structure the workload to re-use a large shared prefix. Cached input at $0.003625/M means you can keep an entire 800-page contract in context for hundredths of a cent per query.

Opus 4.7 or GPT-5.5 are still worth paying for when:

You are building long-horizon agents that chain hundreds of tool calls across hours. Reliability over long trajectories is still where the frontier models lead.
You are doing novel hard reasoning — Olympiad math, ARC-AGI-style tasks, original research synthesis. GPT-5.5 Pro and Opus 4.7 keep a meaningful edge.
You need multimodal input (vision + audio) as a first-class capability. V4 is text-first; multimodal is hinted but not at Pro tier yet.

Qwen3.7 Max is the other Chinese-frontier option to weigh: it edges V4-Pro by about 3.6 points on reasoning and 2.6 points on coding head-to-head, and generates roughly 3.8x more tokens per second. At a 7.7x cost premium, the cost-per-quality math still favors V4-Pro unless latency or agentic reliability is dominating your UX. The open-source LLMs landscape pillar covers the broader Chinese-frontier comparison if you want the full matrix.

How do you actually switch over?

The cheapest migration is the Anthropic-compatible endpoint if you already have a Claude codebase. Two lines change:

from anthropic import Anthropic

client = Anthropic(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/anthropic",
)

resp = client.messages.create(
    model="deepseek-v4-pro",       # or use "claude-opus-4-7" — auto-routed to V4-Pro
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)

DeepSeek's Anthropic shim maps claude-opus-* names to V4-Pro and claude-haiku-* to V4-Flash, so you can A/B test by toggling base_url alone. Caveat: prompt-caching semantics differ. DeepSeek's cache is automatic and far cheaper per cached token, but cache_control blocks from Anthropic's spec are silently ignored — V4 routes through its own cache.

For OpenAI-compatible apps, the only easy-to-miss gotcha is the base URL:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",     # no /v1 suffix — the SDK appends it
)

resp = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Hi"}],
)

Old V3.2-era guides often baked /v1 into the base URL, which 404s with the current SDK. If you are seeing inexplicable 404s on a working API key, that is almost always why.

What about Cursor, Claude Code, and OpenRouter?

Cursor IDE. Settings → Models → Add Model. Name: deepseek-v4-pro. OpenAI Base URL: https://api.deepseek.com (no /v1). Paste your DeepSeek key, click Verify. Two gotchas worth knowing up-front: Cursor's Composer panel hides DeepSeek's reasoning_content field, so use Chat if you want to actually see the chain of thought; and Tab autocomplete still routes to Cursor's proprietary fast model regardless of the custom-model setting. The full Cursor + DeepSeek V4 setup guide walks through the reasoning-proxy trick and the Composer caveat in detail.

Claude Code. Set ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic, ANTHROPIC_API_KEY=<your-deepseek-key>, and ANTHROPIC_MODEL=deepseek-v4-pro. Claude Code speaks the Anthropic Messages API natively, so this is a true drop-in. Same caching caveat as above — DeepSeek handles caching transparently and you do not need cache_control markers.

OpenRouter. Both models are live with multi-provider routing. Useful if you want automatic failover to a different DeepSeek host when the primary endpoint is rate-limiting you.

What are the rate limits at the new price?

DeepSeek does not publish fixed RPM or TPM tables. Instead, the API applies a dynamic concurrency cap per user_id that flexes with overall server load and your recent usage pattern. The currently published caps on the pricing page are:

V4-Pro: 500 concurrent requests per user_id
V4-Flash: 2,500 concurrent requests per user_id

Practical implications. When you hit the dynamic cap, you get an HTTP 429. The cap can move during high-traffic windows — during the May 23-24 announcement surge, several developers reported 429 storms on Hacker News even on accounts that had been operating cleanly. Build a retry-with-backoff strategy from day one and keep a fallback path to Claude or GPT for the rare windows where DeepSeek is genuinely saturated.

What is the July 24 alias retirement and why does it matter?

If your code calls deepseek-chat or deepseek-reasoner, you are on borrowed time. Those legacy aliases retire on 2026-07-24 at 15:59 UTC. Today they transparently route to V4-Flash. After the cutoff, they return a hard error and any traffic that still uses those names will 500 in production.

The fix is mechanical: search your codebase and config for the old aliases, replace with deepseek-v4-pro or deepseek-v4-flash explicitly, deploy. Do this before mid-July if you want a clean buffer. Note that some popular wrappers (older versions of LangChain's ChatDeepSeek, some Cursor presets) still ship the legacy names by default — pin the version or override the model parameter.

Can you self-host V4-Pro instead?

Yes, with caveats. Both V4-Pro and V4-Flash are open weights under MIT on HuggingFace at deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash. The full FP8 V4-Pro checkpoint is roughly 862 GB and needs about 900 GB of VRAM in production — practically, 8x H100 or 8x MI300X, or 2x Blackwell B200 nodes.

Quantized Q4_K_M lands V4-Pro in 48-80 GB VRAM (2-4x RTX 4090) but with a quality drop on long-context. For most individual developers, the practical local-dev target is V4-Flash on a single high-end GPU. vLLM ≥0.7.3 is the recommended inference framework — it supports V4's MoE expert parallelism, the hybrid CSA+HCA attention, and 1M-token KV cache management. Ollama works for prototypes but loses MoE-routing efficiency; do not put it under real production load.

At the permanent API price of $0.435 / $0.87 per million tokens, self-hosting is rarely the right economic call unless you have hard data-residency requirements or already own the GPUs and need to amortize them. The break-even point against DeepSeek's API has been pushed considerably further out by the May 22 announcement. For broader hardware planning across the open-weight ecosystem, the open-source LLMs landscape 2026 pillar walks through the full quantization-vs-quality tradeoffs.

Does this signal a V5 or V4.5 release soon?

No public announcement of either as of this writing. The May 22 event was a pricing change, not a model release. DeepSeek's roadmap pattern has been roughly nine to twelve months between major versions (V3 → V3.2 → V4), so a V5 in late 2026 is plausible but not confirmed. Any rumors of "DeepSeek V5 leaked" or "V4.5 coming soon" circulating on X or Reddit at the moment trace back to a HuggingFace discussion thread that proposed renaming the V4-Pro PDF — not an actual model.

The practical implication: V4-Pro at the new permanent rate is the model to plan production workloads around for at least the next two to three quarters. There is no near-term "wait for the next version" argument.

Bottom line

The most useful way to think about May 22, 2026 is this. The frontier-adjacent LLM market has, for the first time, a permanent reference price under a dollar per million tokens that is also genuinely usable for coding, reasoning, and long-context work. That changes the math on which workloads stay on Claude or GPT "because they are better" and which ones move because the cost gap eats the quality gap an order of magnitude over.

Plan three things:

Audit your existing OpenAI and Anthropic bills. Identify the workloads where quality is "good enough at any frontier model" (bulk extraction, summarization, classification, code generation for clear tasks). Those are the migration candidates.
Use V4-Pro's cached-input rate aggressively. $0.003625/M cached input is the magic. Front-load shared context — system prompts, document corpora, RAG knowledge bases — so 90% of input tokens hit the cache.
Update your model aliases. Anything calling deepseek-chat or deepseek-reasoner needs to switch to deepseek-v4-pro or deepseek-v4-flash before July 24, 2026.

For most production stacks already using DeepSeek, the next 30 days should be a quiet win: same model, same code, three-quarters off the bill. For stacks still on Claude or GPT for cost-insensitive workloads, the question is no longer "is DeepSeek viable" but "which specific workloads do we move first."

If your team is making that call right now and would rather have a vetted senior engineer do the migration with you than figure it out alone, Codersera helps companies hire remote developers with first-hand production experience across DeepSeek, OpenAI, and Anthropic stacks.

FAQ

What is the new DeepSeek V4-Pro price as of May 2026?

$0.435 per million cache-miss input tokens, $0.003625 per million cached input tokens, and $0.87 per million output tokens. These are the permanent standing rates as of May 22, 2026 — what was formerly a time-boxed 75% promotional discount.

Did V4-Flash get the same price cut?

No. V4-Flash pricing is unchanged: $0.14 per million input tokens, $0.0028 per million cached input tokens, $0.28 per million output tokens. Only V4-Pro was repriced.

Is V4-Pro really 75% off, or is this a different model?

It is the same model that launched April 24, 2026 — 1.6T total parameters, 49B active, MoE, 1M-token context, MIT license. No architectural change, no benchmark change. Only the pricing changed.

How does V4-Pro compare to Claude Opus 4.7 on price?

V4-Pro is roughly 11.5x cheaper than Opus 4.7 on input tokens and 28.7x cheaper on output tokens. A blended 1M-input + 1M-output workload costs about $1.31 on V4-Pro versus $30.00 on Opus 4.7. With cached input, V4-Pro can drop to roughly $0.88.

How does V4-Pro compare to GPT-5.5?

V4-Pro is roughly 11.5x cheaper than GPT-5.5 on input and 34.5x cheaper on output. The gap widens against GPT-5.5 Pro to about 207x cheaper on output. On benchmarks, V4-Pro is close on coding and math, behind on the hardest novel reasoning and long-horizon agentic tasks.

Will OpenAI and Anthropic respond with their own price cuts?

No matching cut has been announced as of May 28, 2026. Google quietly lowered Gemini pricing earlier in the month. Industry analysts expect price pressure on the closed-source frontier through Q3 2026, but timing and magnitude are unclear.

What is the rate limit on V4-Pro?

500 concurrent requests per user_id, with a dynamic cap that flexes with overall server load. DeepSeek does not publish fixed RPM or TPM limits. HTTP 429 is returned when you exceed the dynamic cap.

Can I use V4-Pro with my existing Claude code?

Yes. DeepSeek exposes an Anthropic-compatible endpoint at https://api.deepseek.com/anthropic. Change base_url and use either deepseek-v4-pro or claude-opus-* (auto-routed) as the model name. The cache_control blocks are silently ignored — V4 handles caching automatically.

When do the legacy aliases retire?

2026-07-24 at 15:59 UTC. Code calling deepseek-chat or deepseek-reasoner after that returns a hard error. Switch to deepseek-v4-pro or deepseek-v4-flash before then.

Is V4-Pro available on OpenRouter and HuggingFace?

Yes. OpenRouter lists V4-Pro and V4-Flash with multi-provider routing. HuggingFace hosts the open weights at deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash under MIT license.

Is a DeepSeek V5 or V4.5 coming soon?

No public announcement of either as of May 28, 2026. The May 22 event was a pricing change to V4-Pro, not a new model. Plan production workloads on V4-Pro and V4-Flash for at least the next two to three quarters.

Should I self-host V4-Pro instead of using the API at the new price?

Rarely worth it at $0.435 / $0.87 per million tokens. Full FP8 V4-Pro needs around 900 GB of VRAM (8x H100 or 8x MI300X). Quantized Q4_K_M fits in 48-80 GB but with quality loss on long-context. Self-hosting now mostly makes sense for hard data-residency requirements or when you already own the GPUs. The break-even moved sharply further out after the May 22 cut.