Grok Build

Grok Build vs Claude Code vs Codex CLI: Which Coding Agent Wins in 2026?

Grok Build, Claude Code, and Codex CLI compared on benchmarks, pricing, and workflow — including Anthropic's June 15 metered-credit change.

Published 18 May 2026 • Updated 23 May 2026 • 11 min read

Quick answer. For raw benchmark scores Codex CLI (GPT-5.5) leads on vendor-reported SWE-bench Verified at 88.7%, Claude Code (Opus 4.7) follows at 87.6%, and Grok Build's coder posts 70.8% (the score belongs to the now-deprecated grok-code-fast-1; the production CLI now runs on grok-build-0.1, released 20 May 2026, with a documented 256K context — not the "2M" some launch-week coverage cited). Grok Build still ships up to 8 parallel subagents. Claude Code is the maturity pick; budget-driven teams should weigh Anthropic's June 15 metered-credit change before committing. Verified 23 May 2026.

By mid-May 2026 the dominant question in engineering channels is not whether to run a terminal coding agent — it is which one. Three serious contenders now live in the same shell prompt: Anthropic's Claude Code, OpenAI's Codex CLI, and — as of May 14, 2026 — xAI's Grok Build. All three plan, edit files, run commands, and iterate against your codebase autonomously. They differ sharply on benchmark scores, on how they bill you, and on what their agentic loop actually feels like to work with.

This is a senior-engineer's comparison: labeled benchmarks (no fabricated numbers), a cost model breakdown that flags Anthropic's June 15 billing overhaul, the workflow differences that matter day to day, and a decision matrix for picking the right one per use case. Where neutral third-party data does not exist yet — Grok Build is days old — we say so and give a reasoned framework instead of a number.

Want the full picture? Read our continuously-updated AGENTS.md and SKILL.md complete guide — the open standard 60k+ repos use to give Cursor, Codex, Claude Code, Copilot, and 20+ other agents the build commands, conventions, and boundaries they need to actually follow project rules.

Why does the three-way coding-agent race matter now?

For most of 2025 the terminal-agent space was effectively a two-horse race: Claude Code and Codex CLI. Cursor and others lived in the editor; the CLI tier was Anthropic versus OpenAI. Three things changed that in a single month:

xAI shipped Grok Build (May 14, 2026) and followed up with the purpose-built grok-build-0.1 model on May 20, 2026. An early-beta agentic CLI with Plan Mode and up to eight parallel subagents by default — a credible third entrant, not a toy (xAI announcement, grok-build-0.1 docs).
Anthropic announced a June 15, 2026 billing change that meters programmatic Claude usage separately from interactive use — a material cost shift for any team running Claude Code in automation (The Decoder).
The benchmark gap narrowed. Codex CLI on GPT-5.5 and Claude Code on Opus 4.7 are now within ~1 point of each other on vendor-reported SWE-bench Verified, so the decision is no longer "pick the highest number."

The net effect: choosing a coding agent in 2026 is a real procurement decision with cost, lock-in, and workflow consequences — not a coin flip between two near-identical tools.

What are the three contenders, exactly?

Agent	Vendor	Default model	Launched / status	Context window	Signature feature
Claude Code	Anthropic	Claude Opus 4.7	GA, mature	~200K (1M on Opus 4.7 long-context)	Plan Mode + self-verification on long-horizon tasks
Codex CLI	OpenAI	GPT-5.5	GA, mature	Large (GPT-5.5 class)	Auto-review agent + omnimodal model
Grok Build	xAI	`grok-build-0.1` (released 20 May 2026)	Early beta (CLI launched 14 May 2026)	256K tokens (documented)	Plan Mode by default + up to 8 parallel subagents

Claude Code is the incumbent most teams already run. It is the most mature of the three: stable in CI/CD, well-documented, broad ecosystem. Opus 4.7 added self-verification on long-running agentic tasks, which materially reduces "the agent confidently shipped a broken change" failures — see our complete Claude Opus 4.7 developer guide for the model side of this.

Codex CLI is OpenAI's terminal agent on GPT-5.5. Its differentiator in 2026 is a separate built-in review agent that critiques your diff before you commit, plus a natively omnimodal backing model (text, images, audio, video in one architecture). For a two-way deep dive on just these two incumbents, see our honest Claude Code vs OpenAI Codex engineering-team comparison.

Grok Build is the newcomer. It leads with Plan Mode on by default (Claude Code's most-requested feature), native parallel subagents, full Agent Coordination Protocol (ACP) support, and — as of 20 May 2026 — a purpose-built backing model in grok-build-0.1 with a documented 256K-token context window (some launch-week press cited a "2M-token" context; that figure is not in xAI's docs). It is early beta — gated to SuperGrok Heavy subscribers — and unproven in production (DevOps.com). If you want to try it yourself, our step-by-step Grok Build install guide walks through setup and the SuperGrok Heavy gating.

How do they compare on benchmarks?

Every number below is vendor-reported unless stated otherwise. Treat them as directional, not gospel — each vendor runs its own harness, and harness differences alone can move SWE-bench by several points. There is no neutral, audited head-to-head for all three agents as of mid-May 2026; Grok Build is too new for independent evaluation.

Benchmark	Claude Code (Opus 4.7)	Codex CLI (GPT-5.5)	Grok Build (grok-code-fast-1, deprecated 15 May 2026)	Source label
SWE-bench Verified	87.6%	88.7%	70.8%	All vendor-reported
Terminal-Bench 2.0	69.4%	82.0%–82.7%	Not separately reported	Vendor-reported; Grok unverified

Note on the Grok Build score: the 70.8% SWE-bench Verified figure was reported on xAI's earlier grok-code-fast-1 model (deprecated 15 May 2026, retires 15 Aug 2026). The production Grok Build CLI now runs on grok-build-0.1 (released 20 May 2026), for which xAI has not yet published a SWE-bench Verified number. Expect this row to move once the vendor publishes one — directionally, the architecture (longer context + parallel subagents) is meant to lift agentic-loop scores rather than single-pass benchmarks.

Reading the table honestly:

SWE-bench Verified: Codex CLI (88.7%, OpenAI-reported) edges Claude Code (87.6%, Anthropic-reported) by ~1.1 points — within harness noise. Grok Build's underlying coder posts 70.8% on xAI's internal harness, a real gap on this metric (Grey Journal, Vellum on Opus 4.7).
Terminal-Bench 2.0: Codex CLI/GPT-5.5 reports a strong 82%+ (OpenAI-reported); Opus 4.7 reports 69.4% (Anthropic-reported). xAI has not published a comparable Terminal-Bench 2.0 number for Grok Build, so we exclude it rather than guess.

The honest takeaway: on the benchmarks vendors have published, Codex CLI and Claude Code are essentially tied on SWE-bench and Codex leads on Terminal-Bench. Grok Build's raw score is lower, but it is competing on architecture (context size + parallelism), not single-pass benchmark accuracy. Do not pick on the 1-point SWE-bench gap; it will not survive a different harness.

How do the cost and usage models compare?

This is where the decision actually gets made for most teams, and it is the section that changed most in May 2026.

Plan	Claude Code	Codex CLI	Grok Build
Entry tier	$20/mo (Pro) — Claude Code in terminal	ChatGPT Plus tier	None — no individual tier
Mid tier	$100/mo (Max 5x)	ChatGPT Pro $100 (promo: 10x Codex through May 31)	$99/mo intro (first 6 months)
High tier	$200/mo (Max 20x)	ChatGPT Pro $200 (20x)	$299–$300/mo (SuperGrok Heavy, post-promo)
Reported real-world spend	Plan-bounded; API ~$5/$25 per M tok if metered	~$100–$200/dev/mo (high variance)	Bundled into Heavy subscription only

Sources: Anthropic pricing, OpenAI Codex pricing, Martin Cid on Grok Build pricing.

Three structural differences matter more than the headline numbers:

Grok Build has no cheap on-ramp. There is no $20 tier. Access is bundled into SuperGrok Heavy — $99/month for the first six months, then $299–$300/month. That price effectively rules out the solo developer and targets funded teams treating it as a productivity line item (Grey Journal).
Codex CLI usage limits bite under heavy load. The list price is friendly, but multiple users report 5-hour usage windows depleting fast under continuous agentic work — budget for variance, not the sticker (openai/codex#19571).
Claude Code's billing changes June 15, 2026 — read this before you standardize on it.

What is the Claude Code June 15 billing change?

From June 15, 2026, Anthropic splits Claude billing. Programmatic usage — the Claude Agent SDK, the claude -p headless command, the Claude Code GitHub Actions integration, and third-party apps — draws from a new, separate monthly Agent SDK credit pool, billed at full API rates with no subscription discount. Interactive use (chatting with Claude, and running Claude Code interactively in the terminal) still counts against normal subscription limits (InfoWorld, The Decoder).

Subscription	New Agent SDK monthly credit (programmatic)	Behavior at exhaustion
Pro ($20)	$20 credit	SDK calls fail unless extra usage is toggled on
Max 5x ($100)	$100 credit	Same — pay-as-you-go at API list price, no discount
Max 20x ($200)	$200 credit	Same

Vendor-reported credit figures via The New Stack. The practical impact is concentrated on teams running Claude Code in CI/CD, scheduled jobs, or via the SDK at scale: after June 15 that usage is metered at full API price once the modest credit pool drains. Interactive, human-in-the-loop terminal use is unaffected. If your Claude Code footprint is mostly automation, model the post-June-15 cost now — it can be a multiple of today's effective rate for automation-heavy workloads. We break down exactly what to do before the deadline in our Anthropic June 15 billing change action guide.

How do the workflows and UX differ?

Benchmarks measure the model; UX decides whether engineers actually keep the tool. The three diverge most here.

Dimension	Claude Code	Codex CLI	Grok Build
Plan before execution	Plan Mode available; self-verifies on long tasks	Plans internally; auto-review agent gates the diff	Plan Mode on by default — edit/comment/rewrite the plan first
Parallelism	Subagents supported	Subagents for parallel tasks	Up to 8 parallel subagents, native, by default
Review loop	Self-verification before reporting back	Dedicated separate Codex reviewer pre-commit	Plan-level review up front; ACP for multi-agent orchestration
Maturity	Highest — stable in production / CI	High — production-ready	Early beta — not production-hardened
Platform	Broad, incl. Windows	Broad	Beta constraints; verify your platform

The defining workflow contrasts:

Grok Build front-loads control. It writes a plain-English plan listing files to touch, commands to run, and intermediate checks — and you approve, comment on individual steps, or rewrite it before a single line changes. This is the single most-requested Claude Code feature, shipped on by default (Pasquale Pillitteri).
Codex CLI back-loads control. Its differentiator is a separate review agent that critiques the diff before you commit — a second opinion on the output rather than approval of the plan (OpenAI Codex CLI docs).
Claude Code balances both with Plan Mode plus Opus 4.7's self-verification on long-horizon tasks — and wins on the boring thing that matters most in production: it is the most mature and stable of the three.

Companion guide

For the full landscape — every major terminal and editor agent, how the agentic loop works, and how to evaluate one for your team, see our complete guide to AI coding agents in 2026.

Which coding agent wins for which use case?

There is no single winner — the right answer depends on team size, budget, and how much of your usage is automated versus interactive. The decision matrix:

Your situation	Best pick	Why
Solo dev / indie hacker, tight budget	Claude Code (Pro $20) or Codex CLI (Plus)	Only two with a real low-cost on-ramp; Grok Build has no individual tier
Production CI/CD automation, stability-critical	Claude Code or Codex CLI	Both production-hardened; Grok Build is early beta. Model Claude's post-June-15 metered cost first
Automation-heavy Claude SDK / GitHub Actions usage	Re-evaluate before June 15	New Agent SDK credit pool meters programmatic use at full API rate; Codex CLI may be cheaper at scale
Large codebase, whole-repo reasoning	Claude Code long-context (1M Opus 4.7) or Grok Build (if Heavy budget)	Opus 4.7's 1M long-context is the production-proven path; Grok Build's documented 256K context + 8 parallel subagents is architecturally different but newer — pilot for 2 weeks before standardizing
Want plan approval before any file changes	Grok Build	Plan Mode is on by default; you edit/rewrite the plan up front
Want a second-opinion code reviewer in the loop	Codex CLI	Dedicated separate review agent critiques the diff pre-commit
Highest single-pass benchmark accuracy	Codex CLI or Claude Code	~88% vendor-reported SWE-bench, effectively tied; Grok trails at 70.8%
Funded team, multi-agent orchestration ambitions	Grok Build (trial)	Native parallel subagents + full ACP support; pilot for 2 weeks and measure delivery throughput

So which coding agent should you actually use?

The verdict, by profile:

Default / safest choice: Claude Code. It is the most mature, the most stable in production, has the cheapest entry point ($20 Pro), and Opus 4.7's self-verification reduces the worst agentic failure mode. The one asterisk: if your usage is automation-heavy, price the June 15 metered-credit change before you standardize on it.
Best raw scores + built-in review: Codex CLI. It leads vendor-reported SWE-bench (88.7%) and Terminal-Bench 2.0 (82%+), and its separate review agent is a genuine production safety feature. Watch usage-window depletion under heavy continuous load.
Best for funded teams who want plan control and parallelism: Grok Build — as a pilot. Plan Mode by default, 8 native subagents, and a documented 256K-token context on its purpose-built grok-build-0.1 model are architecturally distinct. But it is early beta with no cheap tier and a lower published benchmark score. Trial it for two weeks against your real workload; do not put it in production CI yet.

The meta-point: in 2026 the model gap between the top agents is small enough that workflow fit and cost structure, not benchmark deltas, should drive the decision. Run a two-week bake-off on your own codebase with your own definition of "done" — that signal beats any vendor-reported number in this article.

Who helps teams adopt agentic coding tooling well?

Standing up a coding agent across a team — harness configuration, budget guardrails, CI integration, and an honest evaluation against your codebase — is real engineering work, and getting it wrong is expensive. If you are hiring vetted remote developers experienced with agentic coding tooling, evals, and developer-platform work, Codersera matches you with engineers who have shipped this in production, with a risk-free trial so you can validate technical fit before committing.

FAQ

Is Grok Build better than Claude Code?

Not on raw benchmarks today. The 70.8% SWE-bench Verified figure circulating for Grok Build was reported on xAI's earlier grok-code-fast-1 (deprecated 15 May 2026) versus 87.6% for Claude Code on Opus 4.7. The production Grok Build CLI now runs on the purpose-built grok-build-0.1 (released 20 May 2026), which xAI has not yet benchmarked publicly on SWE-bench Verified. Grok Build competes on architecture instead: a documented 256K context window, up to 8 parallel subagents, and Plan Mode on by default. For production stability and a cheap entry tier, Claude Code is still the safer pick; Grok Build is early beta.

What changes with Claude Code billing on June 15, 2026?

Programmatic Claude usage — the Agent SDK, claude -p, the GitHub Actions integration, and third-party apps — moves to a separate monthly Agent SDK credit pool billed at full API rates ($20 credit on Pro, $100 on Max 5x, $200 on Max 20x, all vendor-reported). When the credit is exhausted, SDK calls fail unless you enable pay-as-you-go extra usage at API list price. Interactive terminal Claude Code and chat use still count against normal subscription limits.

Which coding agent has the highest SWE-bench score?

On vendor-reported SWE-bench Verified, Codex CLI on GPT-5.5 leads at 88.7%, Claude Code on Opus 4.7 is essentially tied at 87.6%, and Grok Build's coder reports 70.8%. The ~1-point gap between Codex and Claude is inside harness noise — each vendor runs its own evaluation harness — so it should not be the deciding factor.

Does Grok Build have a cheap individual plan?

No. There is no $20-equivalent tier. Grok Build access is bundled into SuperGrok Heavy: roughly $99/month for the first six months (introductory), then about $299–$300/month. That pricing targets funded teams, not solo developers — who are better served by Claude Code's $20 Pro tier or Codex CLI on a ChatGPT plan.

What is Plan Mode and which agents have it?

Plan Mode makes the agent write an explicit plan — files to modify, commands to run, intermediate checks — and pause for your approval before changing anything. Grok Build ships it on by default and lets you comment on or rewrite individual steps. Claude Code offers Plan Mode and adds self-verification on long tasks. Codex CLI takes the inverse approach: it gates the resulting diff with a separate review agent rather than the plan up front.

Can I use Grok Build in production CI/CD yet?

Not recommended as of mid-May 2026. Grok Build launched in early beta on May 14, 2026 and is not production-hardened; reviewers consistently flag it as a tool to trial alongside, not replace, a production agent. For CI/CD automation, Claude Code and Codex CLI remain the safer choices until Grok Build reaches general availability.

Which agent is best for large codebases?

Claude Code's Opus 4.7 long-context mode (up to 1M tokens) is the production-proven path. Grok Build's grok-build-0.1 ships a documented 256K-token context plus up to 8 parallel subagents — architecturally distinct, but smaller per-call than Claude Code's long-context mode and still early beta. Verify either on a two-week pilot with your actual repository rather than trusting the context-window number alone.