Claude Opus 4.7: The Complete Developer Guide (2026)

Where Claude Opus 4.7 is still the best model and where DeepSeek V4 or GPT-5.5 win — with real benchmark numbers, prompt-caching math, and a frank take on cost.

Updated 01 May 2026 • 13 min read

Last updated: May 1, 2026

Anthropic shipped Claude Opus 4.7 on April 16, 2026, just over two months after Opus 4.6. On paper it is an incremental release. In practice it is the first model where you can hand off a multi-hour engineering task and reasonably expect it to come back with a working pull request. It also ships with a new tokenizer, a reworked thinking API, and a few regressions that will quietly raise your bill if you copy your old prompts forward.

This guide is for engineering leaders, founders, and developers deciding where Opus 4.7 fits in a 2026 stack alongside Sonnet 4.6, Haiku 4.5, GPT-5.5, and DeepSeek V4 Pro. We focus on what changed, what the API actually costs, where it beats and loses to its peers, and when it is the wrong tool for the job.

TL;DR

What it is: Anthropic's flagship reasoning and coding model, released April 16, 2026. Same $5 / $25 per million tokens (input / output) as Opus 4.6, but a new tokenizer that produces roughly 1.0–1.35x more tokens per request.
Why it matters: 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro — a clear lead over GPT-5.5 (58.6%) and DeepSeek V4 Pro (55.4%) on real-world software engineering tasks.
Where it loses: GPT-5.5 still leads the Artificial Analysis Intelligence Index (60 vs 57) and on terminal/agent breadth. DeepSeek V4 Pro is roughly 7x cheaper at $3.48 per million output tokens.
What is new: 1M-token context at standard pricing, adaptive thinking only (manual budgets removed), 3.75 MP vision, MCP-Atlas score of 77.3%, task budgets in beta, and a new xhigh effort level in Claude Code.
What broke: Web research and source-attribution accuracy regressed. Long-form prose got more mechanical. Code comments dropped from ~8% to ~4% of output. Manual thinking.budget_tokens calls now error.
Bottom line: Default to Sonnet 4.6 for 80% of work. Reach for Opus 4.7 when the task is hard enough that one Opus run beats five Sonnet retries. Use Haiku 4.5 for high-volume routing, classification, and extraction.

What changed from Opus 4.5 and 4.6

Anthropic shipped three Opus-class models in five months: 4.5 in December 2025, 4.6 on February 5, 2026, and 4.7 on April 16, 2026. The cadence is fast enough that "should I upgrade" is a real question, not a reflex.

The headline gains over Opus 4.6:

Coding: SWE-bench Verified jumps from 80.8% (4.6) to 87.6% (4.7). CursorBench moves from 58% to 70%.
Vision: 4.7 accepts images up to 2,576 px on the long edge (~3.75 megapixels), more than 3x the resolution of 4.6. XBOW visual-acuity scores went from 54.5% to 98.5%.
Agentic execution: 14% better on multi-step workflows at fewer tokens, with a third of the tool errors of 4.6. First Claude model to pass implicit-need tests and to recover gracefully from tool failures that used to halt the agent.
Instruction following: More literal. 4.6 was loose and would silently skip steps; 4.7 does what you asked, including when what you asked was wrong. Prompts tuned for 4.6 will need re-reading.

And the things that quietly broke:

New tokenizer. The same English text produces 1.0–1.35x more tokens than under Opus 4.6's tokenizer. Per-token rates are unchanged; per-request bills can rise meaningfully. Re-cost your workloads before migrating.
Manual thinking budgets removed. thinking: {type: "enabled", budget_tokens: N} is no longer accepted. 4.7 only supports adaptive thinking and decides per-step how much to think.
Thinking content omitted by default. Thinking blocks still stream, but the thinking field is empty unless you opt in.
Web research regressed. Source attribution, contradiction detection, and citation specificity are all worse than 4.6 in head-to-head testing. If you run a research agent that grounds claims in cited sources, validate before flipping the model.

For a deeper side-by-side with the leading open-weights challenger, see our DeepSeek V4 vs Claude Opus 4.7 comparison.

Architecture: extended thinking, tool use, computer use, MCP

Opus 4.7 is built around four primitives that, together, define what "agentic coding" means in 2026.

Adaptive thinking

Adaptive thinking is now the only thinking mode on Opus 4.7. The model decides per turn whether to run a hidden chain-of-thought and how long it should be. For trivial questions it skips thinking entirely and answers in one round trip. For a SWE-bench-grade bug fix it can think for tens of thousands of tokens before emitting a single character of code.

The trade-off: you lose the deterministic ceiling that budget_tokens gave you in 4.6. To bound spend, use the new task budgets beta, which sets a hard token ceiling on an agentic loop and lets the model see a running countdown so it can finish gracefully instead of cutting off mid-task or surprising you with the bill.

Tool use and MCP

Tool use is unchanged at the protocol level: you declare tools in the request, Claude emits structured tool_use blocks, you respond with tool_result. The Model Context Protocol (MCP) is now the de-facto standard for connecting Claude to filesystems, databases, browsers, and internal services without wiring each one into your prompt.

The behavioral upgrade matters more than the protocol. Opus 4.7 scores 77.3% on MCP-Atlas, a benchmark for scaled multi-tool agentic tasks, and it keeps executing through tool failures that would have halted Opus 4.6. That is the difference between an agent you can let run for an hour and one you have to babysit.

Computer use

The computer-use API (beta header computer-use-2025-11-24) lets Claude take screenshots, move a cursor, click, type, and scroll in a real desktop environment. With 4.7's higher-resolution vision, it can finally read dense web UIs, full-screen IDEs, and design tools without losing detail. It is still beta, still slow, and still best run in a sandboxed VM.

Extended-context coding

The 1M-token context window is now standard pricing on Opus 4.7. That is enough to load a mid-size monorepo, a few hundred pages of design docs, or a long Slack thread alongside the actual prompt. Long-context retrieval did not improve uniformly — precise ordinal recall over hundreds of thousands of tokens is still slightly stronger on Opus 4.6. For most agentic coding work, 4.7 wins anyway.

API basics: Messages, caching, batch, citations, files

You call Opus 4.7 with the model id claude-opus-4-7 against the standard Messages API. The platform features that matter for cost and latency:

System prompts: Standard. Place your tool definitions and reusable context in the system block so they are eligible for caching.
Prompt caching: Mark a content block with cache_control: {type: "ephemeral"} and Anthropic stores the prefix. Cache writes cost 1.25x input ($6.25/M for 5-minute TTL, $10/M for 1-hour TTL). Cache reads are 10% of input ($0.50/M). Minimum cache size on Opus 4.7 is 4,096 tokens.
Batch API: Submit asynchronous jobs and get a 50% discount on both input and output. Effective Opus 4.7 batch rate is $2.50 / $12.50 per million. Combine with caching and your effective input rate drops to roughly $0.25 per million.
Citations: Pass documents as content blocks with citations.enabled = true and Claude grounds responses in the specific sentences it used. Citations work alongside caching — the source documents cache, the per-response citation blocks do not.
Files API: Upload a file once, reference it by id from any future Messages request. Useful for repeated PDF, image, or codebase inputs.

For migration patterns from earlier Claude versions, see how to use Claude 4 and Sonnet with Cursor and Windsurf, which covers the IDE-side wiring you will reuse with 4.7.

Benchmarks: what the numbers actually say

Benchmarks are useful for narrowing your shortlist, not for picking a winner. The table below is the current snapshot for Opus 4.7 against the two models it gets compared to most often: GPT-5.5 (high effort) and DeepSeek V4 Pro (max effort).

Benchmark	Opus 4.7	GPT-5.5	DeepSeek V4 Pro	What it measures
SWE-bench Verified	87.6%	79.2%	~76%	Real GitHub issue fixes
SWE-bench Pro	64.3%	58.6%	55.4%	Harder multi-language SWE tasks
LiveCodeBench	78.5%	~80%	~82%	Competitive programming
Terminal-Bench 2.0	~75%	82.7%	67.9%	Shell agent tasks
MCP-Atlas	77.3%	~74%	~65%	Multi-tool agentic workflows
GPQA Diamond	94.2%	93.6%	~88%	Graduate-level science
MMLU-Pro	89.9%	~91%	~87%	Broad knowledge
IFEval	91.2%	~92%	~89%	Instruction following
HLE (Humanity's Last Exam)	54.7	52.2	37.7	Frontier reasoning
AA Intelligence Index	57	60	~50	Composite

The honest read: Opus 4.7 is the best model on the market for the specific shape of work that is "fix a real bug in a real repo" or "drive a multi-step tool-using agent." GPT-5.5 still has the breadth lead and wins on terminal-style tasks. DeepSeek V4 Pro wins on competitive programming and on cost.

For more on the open-source side, see our DeepSeek V4 complete guide. For older head-to-heads that are still useful for context, see Llama 4 vs Claude 3.7 Sonnet and the DeepSeek V3.1 Terminus vs GPT-5 vs Claude 4.1 comparison.

Pricing across the Anthropic lineup

Per-token pricing on Opus 4.7 is unchanged from 4.6. The catch is the new tokenizer: the same input text now produces up to 35% more tokens, so your per-request cost can drift up even on identical workloads.

Model	Input ($/M)	Output ($/M)	Cache write 5m ($/M)	Cache read ($/M)	Batch (50% off)	Context
Claude Opus 4.7	$5.00	$25.00	$6.25	$0.50	$2.50 / $12.50	1M
Claude Sonnet 4.6	$3.00	$15.00	$3.75	$0.30	$1.50 / $7.50	1M
Claude Haiku 4.5	$1.00	$5.00	$1.25	$0.10	$0.50 / $2.50	200K

A 1-hour cache TTL is also available at 2x the input rate ($10/M on Opus 4.7) and pays for itself once cache reads exceed about eight per stored prefix.

Real-world cost: a coding-agent workload

Consider an autonomous coding agent that fixes 100 medium-complexity bugs per day. A typical run looks like 50,000 tokens of cached context (codebase, conventions, system prompt), 5,000 tokens of fresh input per task, and 8,000 tokens of generated output (thinking + final code).

First-task cost: 50K cache write @ $6.25/M = $0.31, plus 5K input @ $5/M = $0.025, plus 8K output @ $25/M = $0.20. Total: $0.54.
Subsequent task cost (cache warm): 50K cache read @ $0.50/M = $0.025, plus 5K input @ $5/M = $0.025, plus 8K output @ $25/M = $0.20. Total: $0.25.
Daily total (100 tasks): $0.54 + 99 × $0.25 = ~$25/day, or roughly $750/month.
Same workload on DeepSeek V4 Pro: approximately $110/month — about 7x cheaper, with a measurable but small quality drop on hard fixes.
Same workload on Sonnet 4.6: roughly $450/month, with a meaningful quality drop on hard fixes that often shows up as failed tests and retry loops.

The economic question is rarely "is Opus 4.7 worth $25 more per output million than Sonnet 4.6." It is "does one Opus run beat 2-5 Sonnet retries on this task." For senior-grade engineering work the answer is usually yes. For routine refactors and templated CRUD, it is usually no.

Claude Code and agent capabilities

Opus 4.7 is the default model in Claude Code as of mid-April 2026, with a new xhigh effort level sitting between high and max. xhigh is now the default for Opus 4.7 in Claude Code — Anthropic's own bet that the extra latency is worth it for the quality jump on hard problems.

What you actually feel using Claude Code with 4.7:

Long-running tasks that used to die at the 30-minute mark now run for hours and recover from individual tool failures.
Context carries across sessions more reliably — you can stop, come back the next day, and pick up without re-priming.
Ambiguous instructions get clarifying questions less often; the model commits to a path and executes. This is good when you are right and bad when you are not.
Security tuning is over-eager. Claude Code 4.7 has been seen flagging static HTML/CSS as potential malware and refusing edits. This is a tuning regression that will likely be patched, but it is worth knowing about before you put it in front of a junior dev.

If you want to run an open-source equivalent for evaluation, see our guide to running open-source Claude Code OSS.

When to use Opus 4.7 vs Sonnet 4.6 vs Haiku 4.5

The most expensive mistake in 2026 is sending all your traffic to Opus. The second most expensive is sending all of it to Haiku. A three-tier router pays for itself within weeks at any non-trivial volume.

Use case	Recommended model	Why
Hard SWE-bench-grade bugs, architectural design, deep code review	Opus 4.7 (xhigh)	Quality lead is decisive; one good run beats 3-5 retries.
Day-to-day feature work, PR review, content generation, RAG answers	Sonnet 4.6	Best capability-per-dollar; handles 80% of production work.
Routing, classification, extraction, summarisation, chat first-line	Haiku 4.5	3x cheaper than Sonnet, fast enough for real-time.
Long-document precise ordinal retrieval	Opus 4.6 (still)	4.7 regressed slightly on this narrow case.
Web research with strict citation accuracy	Opus 4.6 or GPT-5.5	4.7's source attribution regressed.
Bulk inference where cost dominates quality	DeepSeek V4 Pro	~7x cheaper; trails on hard tasks but close enough for many.

Known limitations

Anthropic's launch post for Opus 4.7 is, by Anthropic standards, unusually candid about what the model is not. The honest picture from third-party reviews:

Token inflation. The new tokenizer increases token counts by 12–35% on typical inputs. Your per-request bills will rise even at unchanged per-token rates.
Web research and citation accuracy regressed. The model is more likely to attribute a claim to the wrong source or paper over conflicting sources. If you ship a research agent, do not migrate without an A/B.
Code comments dropped. Comment density went from 8.2% of output on 4.6 to 3.8% on 4.7. Code is more compact and slightly harder to maintain. Some static-analysis tools (Sonar) report a small increase in blocker/critical findings vs 4.6.
Long-form prose got more mechanical. Opus 4.7 reaches for bullets and headings where 4.6 held a flowing narrative. Marketing and editorial teams may prefer to keep 4.6 in the loop for first drafts.
Competition math. Around 70% on USAMO 2026 — well behind GPT-5.4 and 5.5 on that specific benchmark. Opus 4.7 is not the model to point at olympiad-grade math.
API surface changes. Manual thinking budgets are gone. Thinking content is hidden by default. Beta headers shifted. Migration is not a model-id swap; re-read your client code.
Overzealous safety filtering in Claude Code. Benign code occasionally flagged. Expect this to be patched but plan around it for now.
It still trails GPT-5.5 on the AA Intelligence Index (57 vs 60). Anthropic itself acknowledged Opus 4.7 trails the unreleased Mythos model. The frontier is moving every quarter.

Comparing Opus 4.7 to GPT-5.5 and DeepSeek V4 Pro

The 2026 frontier is a three-way race rather than a single leader.

vs GPT-5.5. GPT-5.5 wins on overall intelligence index, on terminal/agent breadth (Terminal-Bench 2.0 at 82.7% vs Opus 4.7's ~75%), and on cost-per-quality at the top end. Opus 4.7 wins decisively on SWE-bench Verified and SWE-bench Pro, on MCP-Atlas, and on long-running coding agents that need to recover from tool failures. Pick GPT-5.5 if your agent surfaces are heterogeneous and your bottleneck is "can it operate this CLI." Pick Opus 4.7 if your bottleneck is "can it ship this PR."

vs DeepSeek V4 Pro. DeepSeek V4 Pro is roughly 7–10x cheaper per output token ($3.48/M vs $25/M), open-weights, and within striking distance on most benchmarks (within 5–10 points on SWE-bench Pro and GPQA Diamond). It actually leads on competitive programming. The gap shows on long-horizon agentic work and on hard, novel bug fixes — the exact areas where Opus 4.7 invested. For high-volume inference, internal tools, and cost-sensitive products, DeepSeek V4 Pro is the rational default. For mission-critical engineering work, Opus 4.7 still earns its premium.

FAQ

What is the model id for Claude Opus 4.7 in the API?

claude-opus-4-7. Use it as the model parameter in the Messages API.

Did pricing change from Opus 4.6 to 4.7?

No. Per-token pricing is identical at $5 input / $25 output per million. Effective per-request cost rises because the new tokenizer produces 1.0–1.35x more tokens for the same English text.

Does Opus 4.7 support a 1M-token context window?

Yes, at standard pricing. Output is capped at 128K tokens.

Is manual `thinking.budget_tokens` still supported?

No. Opus 4.7 supports adaptive thinking only; the manual budget parameter now errors. Use the new task-budgets beta to bound spend on agentic loops.

How much does prompt caching save?

Cache reads cost 10% of the standard input rate ($0.50 per million tokens on Opus 4.7). Cache writes cost 1.25x input for a 5-minute TTL. The break-even point is roughly two cache reads per write.

How much does the Batch API save?

50% off both input and output, bringing effective rates to $2.50 / $12.50 per million on Opus 4.7. Batches can take up to 24 hours and are best paired with the 1-hour cache TTL for shared context.

Should I migrate from Opus 4.6 to 4.7 today?

Yes for coding, agentic workflows, and vision tasks. Stay on 4.6 if your workload is web research with strict citation accuracy, long-form creative prose, or precise ordinal retrieval over hundreds of thousands of tokens.

Is Opus 4.7 better than GPT-5.5?

For software engineering, yes — it leads on SWE-bench Verified (87.6% vs 79.2%) and SWE-bench Pro (64.3% vs 58.6%). For broad intelligence and terminal/agent breadth, GPT-5.5 still leads.

Is Opus 4.7 better than DeepSeek V4 Pro?

On hard, novel coding tasks and long-horizon agents, yes. On cost-per-quality, no — DeepSeek V4 Pro is roughly 7x cheaper per output token and competitive on most benchmarks. Use both: Opus 4.7 for the hardest work, DeepSeek V4 Pro for bulk.

Does Opus 4.7 work with Cursor and Windsurf?

Yes. Both Cursor and Windsurf added Opus 4.7 to their model picker within days of launch. See our guide on using Claude 4 and Sonnet with Cursor and Windsurf for setup patterns that carry forward.

Does Opus 4.7 support computer use?

Yes, via the computer-use-2025-11-24 beta header. The 3.75 MP vision upgrade makes it noticeably better at reading dense web UIs and IDEs than earlier versions.

Where is Opus 4.7 available besides the Anthropic API?

Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Snowflake Cortex AI, GitHub Copilot Pro+, and Claude Code. Pricing parity varies by platform; the Anthropic-direct API is usually the cheapest.

What is task budgets and when should I use it?

A beta primitive that sets a hard token ceiling on an agentic loop and exposes a running countdown to the model so it prioritises and finishes gracefully. Use it whenever you let Opus run unsupervised for more than a few minutes.

Will Opus 4.7 replace human engineers?

No. It will replace engineers who do not use it. The bottleneck for shipping software is still architecture, code review, judgement on tradeoffs, and accountability for production. Opus 4.7 is a force multiplier on a senior engineer; it is not a substitute for one.

Next steps

If you are deciding where Opus 4.7 fits in your stack, the cheapest experiment is also the most informative: pick one workflow, route it to Opus 4.7 for a week, and measure. If the workflow is "ship more software, faster, with fewer regressions," you also need engineers who can wire it up properly — prompt caching, MCP, task budgets, evals, the lot.

Hire a Codersera-vetted Python or AI engineer to integrate Opus 4.7 into your codebase, build the routing layer that sends the right task to the right model, and stand up the evals that tell you whether it is actually working. Vetted, remote-ready, and available in days — not months.