Claude

Claude Sonnet 5: Benchmarks, Pricing & How It Compares

Anthropic's most agentic Sonnet yet, launched June 30, 2026. Full benchmark table, real pricing (including the tokenizer catch), availability, and honest verdicts vs Sonnet 4.6, Opus 4.8, GPT-5.5 and Gemini.

Published 30 Jun 2026 • Updated 30 Jun 2026 • 13 min read

Quick answer. Claude Sonnet 5 (model ID claude-sonnet-5) launched on June 30, 2026 as Anthropic's most agentic Sonnet yet. It ships with a 1M-token context window and lands close to Opus 4.8 quality for far less money. Introductory pricing is $2/$10 per million tokens through August 31, 2026, then $3/$15. It's the new default for Free and Pro users on claude.ai and is live in Claude Code, the Claude API, Cursor, VS Code, and GitHub Copilot.

On June 30, 2026, Anthropic shipped Claude Sonnet 5, the newest model in its mid-tier "Sonnet" line and, by the company's own framing, its most agentic Sonnet to date. The pitch is simple to state and harder to verify: Sonnet 5 plans multi-step work, drives browsers and terminals, and runs autonomously at a level Anthropic says recently required larger, more expensive models — while costing the same on paper as Sonnet 4.6 and well under half of Opus 4.8 at its introductory rate.

That positioning — near-flagship quality at workhorse prices — is exactly the kind of claim worth pulling apart before you rewire your stack. This guide walks through what actually changed versus Sonnet 4.6, the benchmarks Anthropic published, real pricing including a tokenizer change that quietly affects your bill, where you can run the model today, and honest head-to-head verdicts: Sonnet 5 against Sonnet 4.6, against Claude Opus 4.8, and against the GPT and Gemini competition. Every number below comes from Anthropic's launch post, its system card, and the official model docs — nothing from guesswork.

What is Claude Sonnet 5?

Claude Sonnet 5 is a general-purpose frontier model positioned in the middle of Anthropic's lineup — above the cheaper Haiku 4.5 and below the flagship Opus 4.8, with the premium-priced Fable 5 sitting at the top of the range. Its calling card is agentic work: not just answering a prompt, but planning a sequence of steps, calling tools, reading the result, and continuing without a human nudging it at every turn. Anthropic describes it plainly as a model built to plan, use browsers and terminals, and run autonomously.

The headline specs:

Context window: 1M tokens. (Launch-day reports noted the window briefly appeared smaller before settling at the full 1M.)
Max output: 128k tokens, raisable to 300k via the batch-API beta header output-300k-2026-03-24.
Adaptive thinking: always on, with effort defaulting to high on the API and in Claude Code.
Knowledge / training cutoff: January 2026.
Latency: Fast.
Model ID: claude-sonnet-5 on the Claude API.

"Adaptive thinking always on" is the most consequential of these for day-to-day use. Rather than you toggling a reasoning mode, the model scales its internal deliberation to the task — light for a quick rewrite, heavier for a multi-file refactor or a research question. Combined with the high effort default, the practical result is that Sonnet 5 spends more compute reasoning before it acts than Sonnet 4.6 did out of the box, which likely contributes to the more reliable task-completion early users report (more on that below).

What's actually new versus Sonnet 4.6?

If you're already running Sonnet 4.6 in production, this is the section that matters. The gains are real and they're not uniform — reasoning, tool use, and long-context coding moved the most. At a glance, the version-over-version deltas on the headline benchmarks:

Terminal-Bench 2.1: +13.4 points (67.0% → 80.4%) — the biggest single jump
Humanity's Last Exam, no tools: +8.6 points (34.6% → 43.2%)
Humanity's Last Exam, with tools: +10.6 points (46.8% → 57.4%)
SWE-bench Pro: +5.1 points (58.1% → 63.2%)
OSWorld-Verified: +2.7 points (78.5% → 81.2%)
GDPval-AA v2 knowledge work: +223 (1395 → 1618)

Reasoning

On Humanity's Last Exam, a deliberately brutal academic benchmark, Sonnet 5 scores 43.2% with no tools versus Sonnet 4.6's 34.6% — an 8.6-point jump. With tools enabled, Sonnet 5 reaches 57.4% versus 46.8%. That with-tools number is the eye-catcher: it nearly matches Opus 4.8's 57.9%, meaning that on this particular test, once you give Sonnet 5 a calculator and a search tool, the reasoning gap to the flagship almost closes.

Tool use and the terminal

Terminal-Bench 2.1 measures how well a model operates inside a shell — running commands, reading output, recovering from errors. Sonnet 5 jumps to 80.4% from Sonnet 4.6's 67.0%, a 13.4-point gain and the single largest improvement in the headline table. For anyone running agents that live in a terminal, this is the most tangible upgrade.

Agentic coding

On SWE-bench Pro — note, the harder Pro variant, not Verified — Sonnet 5 scores 63.2% versus 58.1%. Cursor's own internal CursorBench tells a similar story: 57% versus 49% for Sonnet 4.6, which Cursor publicly called a meaningful step up. The long-context coding picture is even stronger: on ProgramBench, Sonnet 5 lands in the 76–86% band versus Sonnet 4.6's 52–74%, closing most of the distance to Opus 4.8's 80–90%. If your work involves reasoning across a large codebase rather than a single file, that's the number to watch.

Knowledge work

GDPval-AA v2 scores professional knowledge work — the kind of analysis, drafting, and synthesis a consultant or analyst does. Here Sonnet 5 posts 1618, up substantially from Sonnet 4.6's 1395. It's also the one headline metric where Sonnet 5 edges out Opus 4.8 (1615), which we'll come back to.

The behavioral change: it finishes

Benchmarks aside, the most-cited difference from early users isn't a number — it's completion. The recurring sentiment on Reddit's r/AI_Agents was that Sonnet 5 "is almost as good as Opus 4.8, but very fast and cheaper," and that it actually drives long agentic chains to done rather than stalling, asking for permission, or declaring victory early. That tracks with the higher Terminal-Bench score, the always-on adaptive thinking, and the high effort default. For background on how agentic coding models are evaluated and deployed, our AI coding agents guide covers the broader landscape.

How does Sonnet 5 do on benchmarks?

Here are the headline benchmarks from Anthropic's launch post. Opus 4.8 is included as a reference ceiling; Sonnet 4.6 as the model you're likely upgrading from.

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8 (ref)
SWE-bench Pro (agentic coding)	63.2%	58.1%	69.2%
Terminal-Bench 2.1 (terminal & tool use)	80.4%	67.0%	82.7%
OSWorld-Verified (computer use)	81.2%	78.5%	83.4%
Humanity's Last Exam — no tools	43.2%	34.6%	49.8%
Humanity's Last Exam — with tools	57.4%	46.8%	57.9%
GDPval-AA v2 (knowledge work)	1618	1395	1615

Two things to read carefully here. First, the coding benchmark is SWE-bench Pro, the harder variant — don't mistake it for the more commonly quoted SWE-bench Verified, where scores run higher. Second, the pattern is consistent: Sonnet 5 sits clearly above 4.6 across the board, and clearly below Opus 4.8 on everything except the reasoning-with-tools and knowledge-work rows, where it nearly matches or nudges ahead.

Anthropic's system card adds a few more evals worth knowing. On BrowseComp 25, an agentic web-search benchmark, Sonnet 5 scores 84.7% — run with a 10M-token operating limit and context compaction kicking in at 200k, which gives you a sense of how Anthropic expects long-horizon agentic sessions to behave. The same card is where the cross-vendor numbers live, which we'll get to in the competition section.

What does Sonnet 5 cost — and what's the tokenizer catch?

List pricing is straightforward. The wrinkle is underneath it.

Model	Input ($/MTok)	Output ($/MTok)
Sonnet 5 — intro (through Aug 31, 2026)	$2	$10
Sonnet 5 — standard (from Sep 1, 2026)	$3	$15
Sonnet 4.6	$3	$15
Opus 4.8	$5	$25
Haiku 4.5	$1	$5
Fable 5	$10	$50

Through August 31, 2026, Sonnet 5 runs at an introductory $2 input / $10 output per million tokens. From September 1 it moves to $3 / $15 — the same list price Sonnet 4.6 has. OpenRouter is matching the intro rate with a $2/$10 promo. On the surface, that means a quality upgrade at the same list price once you migrate.

Except it's not quite that clean. Sonnet 5 ships with an updated tokenizer that maps the same text to roughly 1.0–1.35x more tokens. So even at an identical $3/$15 list price, your real cost-per-task can rise versus Sonnet 4.6, because the same prompt and the same response now count as more tokens. Reddit was quick to flag this as a quiet price increase, with users skeptical that a matching list price really means matching spend. Anthropic appears to have set the intro pricing precisely to be roughly cost-neutral against 4.6 during the transition window, which makes the September 1 step-up the date to budget around.

Practical takeaway: don't assume flat spend just because the per-token list price matches. Run a representative sample of your real prompts through both models and compare the actual token counts and dollar totals, not the rate card. If your workload is token-heavy — large contexts, long outputs — the tokenizer change is where the savings can quietly erode.

Where can you use Claude Sonnet 5?

Availability on day one was unusually broad — Anthropic clearly coordinated launch partners. Sonnet 5 is the new default model for Free and Pro users on claude.ai, and available to Max, Team, and Enterprise. First-party surfaces:

Claude Code
Claude API — model ID claude-sonnet-5
claude.ai and Cowork/Chat
AWS Bedrock — anthropic.claude-sonnet-5
Google Vertex AI and Microsoft Foundry

Third-party coding tools went live the same day: VS Code (via the official extension), GitHub Copilot, Cursor, OpenRouter (with the matching $2/$10 promo), and Command Code. The tooling partners pushed support live on day one rather than waiting weeks — a sign of how coordinated the launch was. The model IDs you'll actually paste into configs:

claude-sonnet-5            # Claude API
anthropic.claude-sonnet-5  # AWS Bedrock
output-300k-2026-03-24     # batch-API beta header to raise max output to 300k

If you route Claude through OpenRouter for billing or fallback flexibility, our walkthrough on using Claude Code with OpenRouter covers the setup. For editor workflows, the trade-offs between Cursor's agent and a raw Sonnet session are unpacked in Cursor Composer vs Claude Sonnet.

Sonnet 5 vs Opus 4.8: when should you use which?

This is the real decision for most teams, because the two models now overlap heavily while Opus costs noticeably more — $5/$25 versus Sonnet's $3/$15 at standard rates (about 1.7x), and relatively more during the intro window. The short version: Sonnet 5 is the new default workhorse; Opus 4.8 is still the accuracy ceiling.

Where Opus 4.8 still wins. Look back at the benchmark table. Opus leads SWE-bench Pro by about 6 points (69.2 vs 63.2), and also leads Terminal-Bench 2.1, OSWorld-Verified, and Humanity's Last Exam with no tools. On ProgramBench long-context coding, Opus's 80–90% band edges Sonnet 5's 76–86%. If your task is raw, hard accuracy on agentic coding or computer use — the kind of work where a 6-point gap means a real difference in success rate on a tough repository — Opus is still the safer call. Anthropic also recommends Opus for cyber work that needs reduced guardrails, since Sonnet 5 has weaker cyber capability and ships with cyber safeguards on by default.

Where Sonnet 5 wins. Price (cheaper than Opus — $3/$15 vs $5/$25 at standard rates, and less during the intro window), speed (Fast latency), and — notably — GDPval-AA v2 knowledge work, the one headline metric where Sonnet 5 (1618) actually edges Opus 4.8 (1615). On that benchmark for professional knowledge tasks — analysis, drafting, synthesis — Sonnet 5 essentially matches the flagship for less. For the very large volume of agentic coding and tool-use work that doesn't sit right at the difficulty frontier, Sonnet 5 is "near-Opus" enough that the price and speed delta usually wins.

Safety is a quiet Sonnet 5 strength. On the system card's robustness evals, Sonnet 5 tied Opus 4.8 for best-in-class prompt-injection resistance and actually beat it on honesty:

Safety metric (lower is better)	Sonnet 5	Sonnet 4.6	Opus 4.8	GPT-5.5	Gemini 3.5 Flash
Prompt-injection attack success	0.19%	1.41%	0.19%	3.08%	6.66%
MASK lying rate	3.1%	13.3%	6.1%	—	—

A 0.19% prompt-injection attack success rate — matching Opus and far below Sonnet 4.6's 1.41%, GPT-5.5's 3.08%, and Gemini 3.5 Flash's 6.66% — matters a lot if you're deploying agents that read untrusted web content or process third-party documents. The MASK lying rate of 3.1% (versus Sonnet 4.6's 13.3% and even Opus's 6.1%) is the kind of honesty improvement that's easy to ignore until an agent confidently fabricates something in production.

One organizational nuance worth flagging if you're on an Enterprise plan: Premium seats still default to Opus 4.8. So depending on your seat tier, the "which model" answer may already be made for you — check before assuming everyone's on the new default.

How does Sonnet 5 compare to GPT-5.6 and Gemini 3.5 Pro?

Here's where we have to be honest about what does and doesn't exist. Anthropic published no official comparison between Sonnet 5 and GPT-5.6 or Gemini 3.5 Pro. That's not an oversight to paper over with vibes — GPT-5.6 had not launched as of June 30, 2026 (track that release in our GPT-5.6 release watch), and the cross-vendor numbers in Anthropic's system card are against GPT-5.5 and Gemini 3.5 Flash, not the Pro tier. Treat any Sonnet-5-vs-GPT-5.6 or Sonnet-5-vs-Gemini-3.5-Pro table you see elsewhere as unofficial extrapolation.

What we can show is the apples-to-apples slice from the system card — Sonnet 5 against the GPT and Gemini models Anthropic actually tested:

Model (system-card figures)	SWE-bench Pro	Terminal-Bench 2.1
Claude Sonnet 5	63.2%	80.4%
GPT-5.5	58.6%	83.4%
Gemini 3.5 Flash	55.1%	—

Even within that limited comparison, the leadership is benchmark-dependent. Sonnet 5 leads on SWE-bench Pro (63.2 vs GPT-5.5's 58.6 and Gemini 3.5 Flash's 55.1), but GPT-5.5 edges Sonnet 5 on Terminal-Bench 2.1 (83.4 vs 80.4). So "best agentic coder" isn't a clean win — it depends which axis you weight. Anyone claiming Sonnet 5 categorically beats the field on agentic coding is overstating it.

For a real GPT-5.6 or Gemini 3.5 Pro comparison, you'll need each vendor's own published figures, and you should flag them as non-apples-to-apples since they won't be run under Anthropic's harness. Our Gemini 3.5 guide tracks Google's side; we'll update the GPT-5.6 comparison once OpenAI ships and publishes numbers under comparable conditions.

Should you switch to Claude Sonnet 5?

For most teams already on Sonnet 4.6, the answer is yes — with eyes open on cost. The reasoning, tool-use, and long-context coding gains are large and consistent, the safety profile is materially better, and during the intro-pricing window through August 31 the upgrade is roughly cost-neutral. The behavioral improvement — agents that finish rather than stall — is the kind of thing that's hard to capture in a benchmark but shows up immediately in real usage.

Three caveats keep this honest:

Mind the tokenizer. Once intro pricing ends, the same $3/$15 list price does not guarantee flat spend, because the new tokenizer counts up to 1.35x more tokens for the same text. Measure your real workload before assuming Sonnet 4.6 parity.
Opus 4.8 still wins raw accuracy. Sonnet 5 beats Opus only on GDPval knowledge work. On SWE-bench Pro, Terminal-Bench, OSWorld, and HLE-no-tools, Opus leads — sometimes by enough to matter on hard tasks. Keep Opus in the toolbox for the frontier-difficulty jobs and cyber work.
It's not a clean crown. Early sentiment was a genuine upgrade, not a coronation. Several community threads noted the gap to Opus 4.8 was smaller than expected but didn't fully close, and Opus loyalists weren't entirely won over. The realistic read: Sonnet 5 is the best value in the Claude lineup, not the outright best model.

A sensible default policy: route the bulk of your agentic coding, tool use, and knowledge work to Sonnet 5; reserve Opus 4.8 for the hardest accuracy-critical jobs and anything needing reduced cyber guardrails; keep Haiku 4.5 for high-volume, latency-sensitive, simple calls. If you're choosing models across the whole ecosystem rather than just within Anthropic's lineup, the AI coding agents guide sets the broader context.

If you're a founder or eng lead trying to operationalize a multi-model setup — routing, evals, agent harnesses, the boring plumbing that makes any of this reliable in production — Codersera places vetted remote engineers who've already built it. That's the only pitch here; the rest is just the numbers.

FAQ

What is Claude Sonnet 5's model ID and context window?

The Claude API model ID is claude-sonnet-5; on AWS Bedrock it's anthropic.claude-sonnet-5. The context window is 1M tokens, with 128k max output (raisable to 300k via the batch-API beta header output-300k-2026-03-24).

How much does Claude Sonnet 5 cost?

Introductory pricing is $2 per million input tokens and $10 per million output tokens through August 31, 2026, then $3/$15 from September 1 — the same list price as Sonnet 4.6. Note the updated tokenizer counts roughly 1.0–1.35x more tokens for the same text, so real per-task spend can exceed Sonnet 4.6 despite the matching rate card.

Is Claude Sonnet 5 better than Opus 4.8?

Not across the board. Sonnet 5 beats Opus 4.8 only on GDPval-AA v2 knowledge work (1618 vs 1615), nearly matches it on Humanity's Last Exam with tools (57.4% vs 57.9%), and ties it on prompt-injection safety. Opus 4.8 still leads on SWE-bench Pro (69.2% vs 63.2%), Terminal-Bench 2.1, OSWorld-Verified, and HLE with no tools, and is recommended for cyber work. Sonnet 5 wins on price and speed.

How does Sonnet 5 compare to GPT-5.6 and Gemini 3.5 Pro?

There is no official Anthropic comparison to either — GPT-5.6 had not launched as of June 30, 2026, and the system card's cross-vendor figures are against GPT-5.5 and Gemini 3.5 Flash. In that limited set, Sonnet 5 leads SWE-bench Pro (63.2% vs GPT-5.5's 58.6% and Gemini 3.5 Flash's 55.1%), but GPT-5.5 edges it on Terminal-Bench 2.1 (83.4% vs 80.4%). Any GPT-5.6 or Gemini 3.5 Pro head-to-head is unofficial.

Where can I use Claude Sonnet 5?

It's the default for Free and Pro users on claude.ai and available to Max, Team, and Enterprise. It's live in Claude Code, the Claude API, AWS Bedrock, Google Vertex, and Microsoft Foundry, plus third-party tools from day one: VS Code, GitHub Copilot, Cursor, and OpenRouter (with a matching $2/$10 promo).

Should I upgrade from Sonnet 4.6 to Sonnet 5?

For most teams, yes — the reasoning, tool-use, and long-context coding gains are large and consistent, safety is meaningfully better, and intro pricing makes the move roughly cost-neutral through August 31. The main caveat is the tokenizer change: measure your real workload's token counts before assuming flat spend once standard $3/$15 pricing kicks in.

What is Claude Sonnet 5?

What's actually new versus Sonnet 4.6?

Reasoning

Tool use and the terminal

Agentic coding

Knowledge work

The behavioral change: it finishes

How does Sonnet 5 do on benchmarks?

What does Sonnet 5 cost — and what's the tokenizer catch?

Where can you use Claude Sonnet 5?

Sonnet 5 vs Opus 4.8: when should you use which?

How does Sonnet 5 compare to GPT-5.6 and Gemini 3.5 Pro?

Should you switch to Claude Sonnet 5?

FAQ

What is Claude Sonnet 5's model ID and context window?

How much does Claude Sonnet 5 cost?

Is Claude Sonnet 5 better than Opus 4.8?

How does Sonnet 5 compare to GPT-5.6 and Gemini 3.5 Pro?

Where can I use Claude Sonnet 5?

Should I upgrade from Sonnet 4.6 to Sonnet 5?

Sign up for more like this.