DeepSeek V4 vs Claude Opus 4.8: 2026 Benchmarks

Quick answer. Pick by workload. DeepSeek V4-Pro — a 1.6T-parameter MIT-licensed MoE released April 24, 2026 — beats Claude Opus 4.8 on competitive programming, long-context retrieval, and open-web research at roughly one-seventh the price. Opus 4.8, now the current top of the Opus line, still wins on real production software engineering, computer use, vision, writing, and code honesty. Self-host V4 for sovereignty; pay for Opus when output quality matters most.

In April 2026, two companies on opposite sides of the Pacific shipped the most consequential AI releases of the year. On April 24, DeepSeek released DeepSeek V4, a 1.6-trillion-parameter mixture-of-experts behemoth under the MIT license, with weights on Hugging Face and a price tag roughly one-seventh of Anthropic's flagship. Anthropic's answer has since matured into Claude Opus 4.8 — the current top of the Opus line, a quiet refinement-pass that tightened the screws on agentic coding, computer use, and code honesty while shipping at the same price as the 4.7 release it succeeds.

The framing writes itself: seven times cheaper, eight points behind on what matters. That is the headline of the DeepSeek V4 vs Claude Opus 4.8 debate, and it is roughly correct, but it badly undersells how strange this matchup actually is. V4-Pro tops Opus on competitive programming, long-context retrieval, and open-web research. Opus crushes V4 on real production software engineering, computer use, vision, and writing. One ships sovereign weights to your own data center; the other ships a hosted API with stricter refusals and a more measurable handle on its own mistakes.

This article is for the engineering leader who has to make a routing decision this quarter, not the casual reader. We will go through pricing, the full benchmark table, the token-burn paradox that catches both vendors, a fresh June 2026 study showing how much of Opus's leaderboard score is actually retrieval, the architectural reason V4 matters technically, the geopolitics of Huawei Ascend, the honest Opus 4.8 picture (gains and gripes), and a use-case-by-use-case recommendation matrix. If you are staffing an AI engineering team in 2026, the model choice and the talent choice are now the same conversation.

Want the full picture? Read our continuously-updated DeepSeek V4 complete guide — benchmarks, pricing, deployment patterns, and how it compares to GPT-5.5 and Claude Opus 4.8.

The two models in 60 seconds

Before the deep dive, here is the stat-card summary. Both models target the same agentic-coding and long-context-reasoning workloads, but they get there from opposite directions.

Spec	Claude Opus 4.8	DeepSeek V4-Pro
Status	Current Opus (2026), succeeds Opus 4.7	Released April 24, 2026 (preview)
Vendor	Anthropic (closed)	DeepSeek (MIT, open weights)
Total / Active params	Not disclosed	1.6T / 49B
Context window	1M tokens	1M tokens
Max output	128K tokens	384K tokens
Reasoning modes	Adaptive (only) + user-facing effort control	Non-Think / Think High / Think Max
Multimodal	Native vision, 2,576px / 3.75MP	Limited vision
Computer use	Online-Mind2Web 84%	n/a
Distribution	Anthropic API, Bedrock, Vertex, Foundry, Copilot	API + downloadable weights, Huawei Ascend optimized
Input / Output (per 1M tok)	$5 / $25	$0.435 / $0.87

One way to read it: Anthropic shipped the better product; DeepSeek shipped the better artifact. Opus 4.8 is a managed service that wins on the production-quality verticals (computer use, vision, agent loops). V4-Pro is the largest open-weight model in history, and it is competitive enough that, as Simon Willison put it, it is "almost on the frontier, a fraction of the price."

Note the model lineage on the Anthropic side: Opus 4.8 is not a clean-sheet model but the latest refinement-pass on the 4.x Opus line. Anthropic describes it as delivering "improvements across benchmarks" over Opus 4.7 at unchanged pricing — so every comparison below that previously ran against 4.7 now runs against a slightly stronger, same-price 4.8.

Pricing and access

The pricing gap is where most of the noise comes from. At standing rates (the 75% V4-Pro discount became permanent on 2026-05-22), V4-Pro is roughly 29x cheaper on output and 11x cheaper on input than Opus 4.8. V4-Flash, the smaller 284B / 13B-active variant, is in another league entirely.

Model	Input ($/1M tok)	Output ($/1M tok)	Output speed
Claude Opus 4.8	$5.00	$25.00	~45 t/s
Claude Opus 4.8 (fast mode)	$10.00	$50.00	~2.5x
DeepSeek V4-Pro	$0.435	$0.87	36.6 t/s
DeepSeek V4-Flash	$0.14	$0.28	variable

Pricing notes. Opus 4.8 ships at the same $5 / $25 per million tokens as Opus 4.7 — the upgrade is free. Its new fast mode runs ~2.5x quicker at $10 / $50 and is roughly 3x cheaper than the fast mode on previous Opus models. Standard pricing keeps the 1M-token context window and 128K max output (model id claude-opus-4-8). On the DeepSeek side, the previously time-boxed 75% V4-Pro discount became the standing rate on 2026-05-22 — $0.435/M input cache-miss, $0.87/M output, $0.003625/M cache-hit per the official DeepSeek pricing page and the Hacker News thread.

Access patterns differ in a way that matters as much as the per-token rate. Opus 4.8 is everywhere a managed model can be: AWS Bedrock, GCP Vertex, Microsoft Foundry, and GitHub Copilot. DeepSeek V4 ships weights to Hugging Face on day zero, and it is the first DeepSeek model optimized end-to-end for Huawei Ascend 950 silicon, which we will return to in the geopolitics section.

If your AI strategy passes through enterprise procurement, that distribution gap is decisive. If it passes through your own GPU cluster (or a sovereign cloud you actually trust), it inverts.

Benchmark deep-dive: DeepSeek V4 vs Claude Opus 4.8

Here is the consolidated benchmark table. We have pulled numbers directly from each vendor's release notes, the Artificial Analysis leaderboard, and the SWE-bench leaderboard.

Benchmark	DeepSeek V4-Pro Max	Claude Opus 4.8	Winner
Artificial Analysis Intelligence Index	52	57	Opus 4.8
SWE-bench Verified	80.6%	88.6%	Opus 4.8 (+8)
SWE-bench Pro	55.4%	64.3%	Opus 4.8
Terminal-Bench 2.0	67.9%	69.4%	Opus (close)
LiveCodeBench	93.5	~88.8 (est.)	V4-Pro
Codeforces rating	3,206	not published	V4-Pro
GPQA Diamond	90.1%	94.2%	Opus 4.8
MMLU-Pro	87.5	not published	—
HMMT 2026 Feb	95.2	96.2	Opus (close)
BrowseComp	83.4	79.3	V4-Pro
MRCR 1M (8-needle)	0.59 at 1M, >0.82 through 256K	not disclosed	V4-Pro
Output cost / 1M tokens	$0.87	$25	V4-Pro ~29x cheaper
Output speed	36.6 t/s	~45 t/s	Opus
Multimodal vision	Limited	2,576px high-res	Opus 4.8
Online-Mind2Web (computer use)	n/a	84%	Opus 4.8
Open weights	Yes (MIT)	No	V4-Pro

Real-world software engineering

This is where the engineering buying decision actually lives, and it is where Opus 4.8 wins decisively. SWE-bench Verified at ~88.6% vs 80.6% is an 8-point gap on the most-cited real-issue benchmark — and the upgrade from 4.7's 87.6% widened that lead. SWE-bench Pro keeps Opus ahead by nearly 9 points (64.3% vs 55.4%). That delta shows up in production. The AkitaOnRails 23-model coding eval places the Opus line ahead of V4-Pro on the harder, multi-file, real-repo tasks that look like what your senior TypeScript developers actually do.

⚠️ Sidebar: how much of that lead is retrieval, not reasoning?

A June 2026 Cursor study audited 731 Opus 4.8 Max traces on SWE-bench Pro and found that 63% of "successful" fixes came from the model retrieving a known answer rather than solving the bug from scratch. When the researchers sandboxed git history and restricted network egress, the score fell from 87.1% to 73.0%. (That 87.1% is the agentic-scaffold “Max” configuration the study audited — higher than the 64.3% standard SWE-bench Pro figure in the table above; different harness, same model.) That is not unique to Anthropic — it is a benchmark-contamination problem the whole field shares — but it is the strongest available evidence for this article's core thesis: the leaderboard number is an upper bound, not your number. Measure $/completed-task on your repo, with your retrieval surface, before you route production traffic.

The picture on Hacker News stays textured. User anonzzzies wrote: "DeepSeek V4 would now be good enough to replace claude for us; we use sonnet only ... it works as well as opus 4.6, 4.7 so far." Translation: if your team's actual baseline is Sonnet (not Opus), V4-Pro is already a credible drop-in at one-fifteenth of the per-task cost. The 8-point benchmark gap matters most when you are at the frontier of what is possible — long-running autonomous agents, security-sensitive refactors, multi-file architectural changes.

Competitive coding

Flip the script for competitive programming. V4-Pro hits a Codeforces rating of 3,206, which is at or above grandmaster level and a number Anthropic conspicuously does not publish for Opus. On LiveCodeBench, V4-Pro scores 93.5 vs an estimated 88.8 for Opus 4.8. If you are building a competitive-programming tutor, an algorithmic-trading research tool, or an Olympiad coach, V4-Pro is the better engine.

Reasoning

Opus 4.8 wins on GPQA Diamond (94.2% vs 90.1%) and edges HMMT 2026 February (96.2 vs 95.2). Both are within margin-of-error close on math olympiad work, but Opus's GPQA lead points to the same conclusion as SWE-bench Pro: when problems require integrating soft, ambiguous knowledge across a domain, Anthropic's post-training still has the edge. Anthropic also says 4.8 is roughly 4x less likely than 4.7 to let flaws in its own code pass unremarked — a judgment/honesty gain that matters more in agent loops than any single reasoning score.

Long-context retrieval

This is the most under-discussed asymmetry in the comparison. DeepSeek V4-Pro publishes MRCR-1M at 0.59 (8-needle), with greater than 0.82 retention through 256K tokens. Anthropic does not publish a comparable number for Opus 4.8. They publish the marketing claim of a 1M context window at standard pricing, which is genuinely useful, but they do not let you measure how good retrieval actually is at the tail. In a market where every vendor advertises 1M tokens, refusing to be benchmarked is its own statement.

Vision and computer use

Opus 4.8 is in a different category here. Online-Mind2Web at 84% for autonomous computer-use tasks — a meaningful jump over both Opus 4.7 and GPT-5.5 — plus a vision pipeline that handles 2,576-pixel images at 3.75 megapixels with 1:1 pixel-to-coordinate mapping (3x the prior limit), means Opus 4.8 can actually drive a browser or a desktop. V4-Pro's vision is limited and there is no published computer-use result. If your roadmap includes RPA, browser-automation agents, or PDF-and-screenshot-heavy workflows, this is not a close call.

Tool calling and agent loops

Opus 4.8's adaptive thinking mode (it is still the only mode — manual extended-thinking budgets return a 400 error) is designed for long, tool-using loops, paired with the task budgets beta that imposes an advisory cap across the full agentic loop rather than a single turn. The 4.8 release adds three things worth wiring into your stack: user-facing effort control in claude.ai (let users dial reasoning depth without an API change), dynamic workflows in Claude Code (hundreds of parallel subagents for codebase-scale migrations), and mid-session system messages in the Messages API (re-steer an agent without tearing down the conversation). V4-Pro's three-mode reasoning ladder (Non-Think, Think High, Think Max — the last needing at least 384K of context) gives you more manual dials, but Opus 4.8 is the more polished agent runtime. Teams shipping production agents on either model should be working with engineers who understand the full stack — see our AI engineering services if that is a gap on your team.

The token-burn paradox

Cheap-per-token is not cheap-per-task. Both vendors are caught by this in 2026, just from opposite directions.

On the Anthropic side, the story changed with 4.8. The much-discussed ~35% token-count jump came with the 4.6 → 4.7 tokenizer swap, and it is now baked in: Opus 4.8 uses the same tokenizer as 4.7, so counts are roughly unchanged 4.7 → 4.8. The better news is efficiency — 4.8's default high effort spends a similar token count to 4.7 but gets more out of it, and on document-heavy workloads Databricks measured a 61% lower token cost over PDFs and diagrams. Adaptive thinking can still ramp into a long reasoning trace without an explicit user opt-in, so your real bill on a coding task can run above the headline rate — but the stealth-hike framing that dogged 4.7 no longer applies to the 4.7 → 4.8 step.

On the DeepSeek side, the cost paradox is worse in a different way. Startup Fortune reported that V4-Pro's reasoning-heavy output costs roughly 15x more to run than V3.2 on equivalent self-hosted infrastructure. The efficiency gains in the architecture (we will get to those) buy you a bigger model that thinks for longer; the net cost-per-task can rise even as cost-per-token falls. DeepSeek has since shipped DSpark speculative decoding for V4 Flash and Pro, lifting throughput 51–400% depending on the workload, which softens the wall-clock side of that bill — but if you are migrating from V3.2 hoping for a cheaper invoice, read the fine print.

The honest takeaway: if you are picking a model on raw $/1M tokens, you are picking it wrong. Run your actual workload through both and measure $/completed-task. Most teams discover the gap is meaningfully smaller than the sticker shock implies — and occasionally that the "cheaper" model is more expensive in production.

Architecture: why DeepSeek V4 matters technically

Even setting aside the price comparison, V4 is the most architecturally interesting open-weight release of 2026. The headlines:

Hybrid attention. V4 alternates layers between Compressed Sparse Attention (CSA, 4x compression) and Heavily Compressed Attention (HCA, 128x compression). At 1M context, this gets V4-Pro down to 27% of the per-token FLOPs and just 10% of the KV cache compared to V3.2. That is what makes 1M tokens economically possible at $0.435 input.
Manifold-Constrained Hyper-Connections (mHC). Residual connections constrained to a Birkhoff polytope. The practical result: signal amplification through the network drops from a chaotic 3000x to a stable 1.6x, which is why training a 1.6T-parameter MoE on 32–33T tokens did not collapse.
Reasoning modes. Non-Think for cheap inference, Think High for default reasoning, Think Max for the big swings (Think Max requires at least a 384K context window, which limits where it can run). Max output runs to 384K tokens, far past Opus's 128K ceiling.
DSpark speculative decoding. A V4-specific speculative-decoding stack for Flash and Pro that lifts throughput 51–400% — a meaningful answer to the "thinks for longer = costs more wall-clock" tradeoff above.

For ML engineers, V4 is a working argument that frontier-scale architecture can change at the substrate level — attention shape, optimizer, residuals, precision, and tokenizer all simultaneously. Latent Space's writeup is correct that V4 is "definitely better than GLM-5.1 but not quite Opus, GPT-5.4 or Gemini 3.1 Pro," and that the open-weight ecosystem remains "4–5 months behind frontier proprietary models." What is new in 2026 is that the gap is now measured in months, not generations. If your team needs people who can actually fine-tune and deploy a model like this, our vetted senior Python developers for AI/ML projects are a starting point.

Geopolitics and sovereignty

The most important thing about DeepSeek V4 is not on the benchmark sheet. V4 is the first DeepSeek model optimized for Huawei Ascend 950 silicon. That is the decoupling moment: a frontier-class open-weight model whose reference inference path runs on Chinese-domestic hardware. MIT Technology Review framed it correctly — V4 matters less for what it does in a single benchmark and more for what it implies about hardware sovereignty for the next generation of models.

For Western enterprises, the story flips. The MIT license and downloadable weights mean V4-Pro is the first model in its weight class that can be deployed inside an EU sovereign cloud, an air-gapped financial-services environment, or a healthcare cluster covered by HIPAA without sending a single token to a foreign API. Macaron's local-deployment guide covers the practical path. If you are in a regulated vertical and have been waiting for a model good enough to self-host, V4-Pro is the moment.

For Anthropic, the implication is that the moat is moving from weights to distribution, safety tuning, and integrated tool use. The Bedrock-Vertex-Foundry-Copilot multi-cloud presence is the moat now, not raw capability.

The Opus 4.8 picture: gains and gripes

It would be dishonest to write a Claude Opus 4.8 review without giving both sides. Start with what genuinely improved over 4.7.

Honesty and judgment. The headline 4.8 gain is not a benchmark point — it is that Anthropic measured the model as roughly 4x less likely than 4.7 to let flaws in its own code pass unremarked, with better overall honesty and self-assessment per early testers. Cognition (the Devin team) reports that 4.8 fixes two of 4.7's most-complained-about behaviors: comment over-verbosity (4.7 narrated every line) and flaky tool-calling. For an autonomous agent, "tells you when it is unsure and stops over-commenting" is worth more than a point on GPQA.

Now the gripes that carried over — or appeared fresh.

Refusals. The Register reported that GitHub AUP-refusal complaints went from "two or three per month" to more than 30 reports in April, and the caution persists on 4.8 — especially around code that touches malware analysis, reverse engineering, or security tooling. Some of this conservatism is by design — Anthropic tuned 4.8 to err cautious on dual-use security code. If your work is in offensive security, run your prompts through it before committing.

Adaptive-thinking confusion and lost dials. 4.8 keeps 4.7's design choices that power users disliked: adaptive thinking is the only mode (no deterministic control over reasoning depth from the API), and sampling parameters like temperature and top_p are still rejected. That makes billing and latency harder to predict.

Mixed anecdotal reports. As with every Anthropic point-release, some users report regressions on specific workloads after the 4.8 update — shallower reasoning on tasks the prior model handled, or occasional hallucinations. These reports are anecdotal and workload-specific; the felt experience varies, and the only reliable test is your own.

For balance, the writing strength is still real. The fundaai 38-task evaluation scored the Opus line at 9.17/10 on writing against V4-Pro Thinking's 8.80. For long-form authoring, executive-summary work, and nuanced editing, Opus remains the model of record.

Recommendation matrix

Mapping use case to model, based on the benchmark and qualitative evidence above.

Use case	Recommended model	Why
Privacy / regulated industry / self-hosting	DeepSeek V4-Pro	MIT license, downloadable weights, no data egress
Best autonomous coding agent	Claude Opus 4.8	SWE-bench Pro 64.3%, Terminal-Bench lead, 4x fewer unremarked code flaws, mature tool loop
Bulk batch inference / cost-sensitive	DeepSeek V4-Flash	$0.14 / $0.28 per 1M tokens; near-V4-Pro quality on simple tasks
Competitive programming or math olympiad	DeepSeek V4-Pro	Codeforces 3,206; LiveCodeBench 93.5
Vision / computer use / long-form writing	Claude Opus 4.8	Online-Mind2Web 84%, 2,576px vision, 9.17/10 writing eval
Long-context retrieval over 256K+ tokens	DeepSeek V4-Pro	Published MRCR-1M numbers; Anthropic does not disclose
Browser automation / RPA	Claude Opus 4.8	Computer use is a closed-API moat
Open research / fine-tuning / model surgery	DeepSeek V4-Pro	Open weights are non-negotiable here

The honest answer for most teams is "both." Route the bulk of your inference to V4-Pro or V4-Flash, and reserve Opus 4.8 for the tasks where its benchmark lead and computer-use capability actually pay for themselves.

What this means for engineering teams

The DeepSeek vs Claude comparison is no longer about which company to bet on. It is about which model belongs in which slot of your stack. The teams that will move fastest in 2026 are the ones that route intelligently — V4-Flash for the long tail of low-stakes inference, V4-Pro for self-hosted regulated workloads and competitive-programming-style problems, Opus 4.8 for autonomous coding agents, computer use, vision, and writing.

Building that routing layer is not a model problem; it is an engineering problem. You need people who can fine-tune V4-Pro on your codebase, run evaluation harnesses against your real workload (not the marketing benchmark — remember the Cursor retrieval study), wire up MCP servers and tool loops, and stand up a sovereign-cloud deployment that legal will sign off on. If you are scaling that capability, our marketplace specializes in pre-vetted senior engineers across the stacks that this work actually runs on — Python for ML pipelines, Go for inference services, Rust for performance-critical agent runtimes, TypeScript for the agent UIs, and Node.js for orchestration glue. Browse our AI engineering blog for more deep dives on building production-grade LLM systems.

FAQ

Is DeepSeek V4 better than Claude Opus 4.8?

Not on the headline benchmarks that map to production software engineering. Opus 4.8 leads SWE-bench Verified by about 8 points (~88.6% vs 80.6%), SWE-bench Pro by nearly 9 points, and computer use (Online-Mind2Web 84%) is Anthropic-only. V4-Pro wins on competitive programming (Codeforces 3,206), long-context retrieval (MRCR-1M 0.59), and price (~29x cheaper output). The right answer depends on the task, not the leaderboard — and a June 2026 Cursor study found a chunk of Opus's SWE-bench Pro score is retrieval rather than reasoning, so test on your own repo.

How much cheaper is DeepSeek V4 than Claude Opus 4.8?

At standing rates (the 75% V4-Pro discount became permanent on 2026-05-22), V4-Pro is $0.435 input / $0.87 output per million tokens, vs Opus 4.8 at $5 / $25 — unchanged from 4.7, so the upgrade is free. That is roughly 29x cheaper on output and 11x cheaper on input. V4-Flash is even cheaper at $0.14 / $0.28. Be aware of the token-burn paradox: V4-Pro's reasoning modes can cost ~15x more to run than V3.2, and Opus's adaptive thinking can ramp token use on a coding task. Always measure $/task, not $/token.

Can I self-host DeepSeek V4-Pro?

Yes. V4-Pro is released under the MIT license with full weights on Hugging Face. It uses FP4+FP8 mixed precision and is the first DeepSeek model optimized for Huawei Ascend 950 chips, but it also runs on standard NVIDIA H100/H200 clusters. Plan for the 1.6T total / 49B active parameter footprint; serving the full Think Max reasoning mode requires at least a 384K context window. DSpark speculative decoding can lift serving throughput 51–400%.

Does Claude Opus 4.8 still support manual extended-thinking budgets?

No. Like 4.7, Opus 4.8 removed manual extended-thinking budget controls — the API returns a 400 error if you pass them. The only mode is adaptive thinking, paired with an advisory task budgets beta that caps tokens across an entire agentic loop. Sampling parameters like temperature and top_p are also rejected. New in 4.8: user-facing effort control in claude.ai for steering reasoning depth without an API change.

Which model has a longer context window?

Both ship with a 1M-token context window, but V4-Pro allows a far larger 384K-token max output vs Opus 4.8's 128K. The other meaningful difference is disclosure: DeepSeek publishes MRCR-1M results (0.59 at 1M with 8 needles, above 0.82 through 256K), while Anthropic does not publish a comparable retrieval benchmark for Opus 4.8 at the long-context tail. If retrieval-at-distance is on the critical path, V4-Pro is the more measurable choice.

Is Claude Opus 4.8 too restrictive for cybersecurity work?

It can be. The Register reported GitHub AUP-refusal complaints jumped from two or three per month to over 30 in April 2026, and some of that caution is by design. Security educators and reverse-engineering professionals have been particularly vocal. If your team is in offensive security, malware research, or red-teaming, run your actual prompts through both models before committing.

Which model should I use for an autonomous coding agent?

Claude Opus 4.8. The combination of SWE-bench Pro 64.3%, Terminal-Bench at 69.4%, Online-Mind2Web 84% computer use, the improved file-system memory tool, dynamic workflows in Claude Code (hundreds of parallel subagents), and a model that is 4x less likely than 4.7 to let its own code flaws slide makes it the most polished agent runtime available. V4-Pro can match it on contained tasks, but Opus 4.8 is more consistent across long, multi-tool, multi-file workflows.

Which model is better for writing?

Claude Opus 4.8. The fundaai 38-task evaluation scored the Opus line at 9.17/10 on writing vs V4-Pro Thinking at 8.80. For executive briefs, technical documentation, and nuanced editing, Opus remains the model of record — though some users report uneven results on heavier reasoning tasks after the 4.8 update, so verify on your own prompts.

Sources and further reading

Last updated: June 29, 2026. Benchmarks shift weekly; if you spot a number that has moved, ping us on the Codersera blog and we will update.