Grok Build vs Claude Code vs Codex CLI: Which Coding Agent Wins in 2026?

Grok Build, Claude Code, and Codex CLI compared on benchmarks, pricing, and workflow — including Anthropic's June 15 metered-credit change.

Quick answer. For raw benchmark scores Codex CLI (GPT-5.5) leads on vendor-reported SWE-bench Verified at 88.7%, Claude Code (Opus 4.7) follows at 87.6%, and Grok Build trails at 70.8% but ships a 2M-token context and 8 parallel subagents (both xAI-reported). Claude Code is the maturity pick; budget-driven teams should weigh Anthropic's June 15 metered-credit change before committing.

By mid-May 2026 the dominant question in engineering channels is not whether to run a terminal coding agent — it is which one. Three serious contenders now live in the same shell prompt: Anthropic's Claude Code, OpenAI's Codex CLI, and — as of May 14, 2026 — xAI's Grok Build. All three plan, edit files, run commands, and iterate against your codebase autonomously. They differ sharply on benchmark scores, on how they bill you, and on what their agentic loop actually feels like to work with.

This is a senior-engineer's comparison: labeled benchmarks (no fabricated numbers), a cost model breakdown that flags Anthropic's June 15 billing overhaul, the workflow differences that matter day to day, and a decision matrix for picking the right one per use case. Where neutral third-party data does not exist yet — Grok Build is days old — we say so and give a reasoned framework instead of a number.

Why does the three-way coding-agent race matter now?

For most of 2025 the terminal-agent space was effectively a two-horse race: Claude Code and Codex CLI. Cursor and others lived in the editor; the CLI tier was Anthropic versus OpenAI. Three things changed that in a single month:

  1. xAI shipped Grok Build (May 14, 2026). An early-beta agentic CLI built on Grok 4.3 beta, with Plan Mode and up to eight parallel subagents by default — a credible third entrant, not a toy (xAI announcement).
  2. Anthropic announced a June 15, 2026 billing change that meters programmatic Claude usage separately from interactive use — a material cost shift for any team running Claude Code in automation (The Decoder).
  3. The benchmark gap narrowed. Codex CLI on GPT-5.5 and Claude Code on Opus 4.7 are now within ~1 point of each other on vendor-reported SWE-bench Verified, so the decision is no longer "pick the highest number."

The net effect: choosing a coding agent in 2026 is a real procurement decision with cost, lock-in, and workflow consequences — not a coin flip between two near-identical tools.

What are the three contenders, exactly?

AgentVendorDefault modelLaunched / statusContext windowSignature feature
Claude CodeAnthropicClaude Opus 4.7GA, mature~200K (1M on Opus 4.7 long-context)Plan Mode + self-verification on long-horizon tasks
Codex CLIOpenAIGPT-5.5GA, matureLarge (GPT-5.5 class)Auto-review agent + omnimodal model
Grok BuildxAIGrok 4.3 betaEarly beta (May 14, 2026)2M tokens (vendor-reported)Plan Mode by default + up to 8 parallel subagents

Claude Code is the incumbent most teams already run. It is the most mature of the three: stable in CI/CD, well-documented, broad ecosystem. Opus 4.7 added self-verification on long-running agentic tasks, which materially reduces "the agent confidently shipped a broken change" failures.

Codex CLI is OpenAI's terminal agent on GPT-5.5. Its differentiator in 2026 is a separate built-in review agent that critiques your diff before you commit, plus a natively omnimodal backing model (text, images, audio, video in one architecture).

Grok Build is the newcomer. It leads with Plan Mode on by default (Claude Code's most-requested feature), native parallel subagents, full Agent Coordination Protocol (ACP) support, and a vendor-reported 2M-token context window. It is early beta — gated to SuperGrok Heavy subscribers — and unproven in production (DevOps.com).

How do they compare on benchmarks?

Every number below is vendor-reported unless stated otherwise. Treat them as directional, not gospel — each vendor runs its own harness, and harness differences alone can move SWE-bench by several points. There is no neutral, audited head-to-head for all three agents as of mid-May 2026; Grok Build is too new for independent evaluation.

BenchmarkClaude Code (Opus 4.7)Codex CLI (GPT-5.5)Grok Build (Grok 4.3 / grok-code-fast-1)Source label
SWE-bench Verified87.6%88.7%70.8%All vendor-reported
Terminal-Bench 2.069.4%82.0%–82.7%Not separately reportedVendor-reported; Grok unverified

Reading the table honestly:

  • SWE-bench Verified: Codex CLI (88.7%, OpenAI-reported) edges Claude Code (87.6%, Anthropic-reported) by ~1.1 points — within harness noise. Grok Build's underlying coder posts 70.8% on xAI's internal harness, a real gap on this metric (Grey Journal, Vellum on Opus 4.7).
  • Terminal-Bench 2.0: Codex CLI/GPT-5.5 reports a strong 82%+ (OpenAI-reported); Opus 4.7 reports 69.4% (Anthropic-reported). xAI has not published a comparable Terminal-Bench 2.0 number for Grok Build, so we exclude it rather than guess.

The honest takeaway: on the benchmarks vendors have published, Codex CLI and Claude Code are essentially tied on SWE-bench and Codex leads on Terminal-Bench. Grok Build's raw score is lower, but it is competing on architecture (context size + parallelism), not single-pass benchmark accuracy. Do not pick on the 1-point SWE-bench gap; it will not survive a different harness.

How do the cost and usage models compare?

This is where the decision actually gets made for most teams, and it is the section that changed most in May 2026.

PlanClaude CodeCodex CLIGrok Build
Entry tier$20/mo (Pro) — Claude Code in terminalChatGPT Plus tierNone — no individual tier
Mid tier$100/mo (Max 5x)ChatGPT Pro $100 (promo: 10x Codex through May 31)$99/mo intro (first 6 months)
High tier$200/mo (Max 20x)ChatGPT Pro $200 (20x)$299–$300/mo (SuperGrok Heavy, post-promo)
Reported real-world spendPlan-bounded; API ~$5/$25 per M tok if metered~$100–$200/dev/mo (high variance)Bundled into Heavy subscription only

Sources: Anthropic pricing, OpenAI Codex pricing, Martin Cid on Grok Build pricing.

Three structural differences matter more than the headline numbers:

  1. Grok Build has no cheap on-ramp. There is no $20 tier. Access is bundled into SuperGrok Heavy — $99/month for the first six months, then $299–$300/month. That price effectively rules out the solo developer and targets funded teams treating it as a productivity line item (Grey Journal).
  2. Codex CLI usage limits bite under heavy load. The list price is friendly, but multiple users report 5-hour usage windows depleting fast under continuous agentic work — budget for variance, not the sticker (openai/codex#19571).
  3. Claude Code's billing changes June 15, 2026 — read this before you standardize on it.

What is the Claude Code June 15 billing change?

From June 15, 2026, Anthropic splits Claude billing. Programmatic usage — the Claude Agent SDK, the claude -p headless command, the Claude Code GitHub Actions integration, and third-party apps — draws from a new, separate monthly Agent SDK credit pool, billed at full API rates with no subscription discount. Interactive use (chatting with Claude, and running Claude Code interactively in the terminal) still counts against normal subscription limits (InfoWorld, The Decoder).

SubscriptionNew Agent SDK monthly credit (programmatic)Behavior at exhaustion
Pro ($20)$20 creditSDK calls fail unless extra usage is toggled on
Max 5x ($100)$100 creditSame — pay-as-you-go at API list price, no discount
Max 20x ($200)$200 creditSame

Vendor-reported credit figures via The New Stack. The practical impact is concentrated on teams running Claude Code in CI/CD, scheduled jobs, or via the SDK at scale: after June 15 that usage is metered at full API price once the modest credit pool drains. Interactive, human-in-the-loop terminal use is unaffected. If your Claude Code footprint is mostly automation, model the post-June-15 cost now — it can be a multiple of today's effective rate for automation-heavy workloads.

How do the workflows and UX differ?

Benchmarks measure the model; UX decides whether engineers actually keep the tool. The three diverge most here.

DimensionClaude CodeCodex CLIGrok Build
Plan before executionPlan Mode available; self-verifies on long tasksPlans internally; auto-review agent gates the diffPlan Mode on by default — edit/comment/rewrite the plan first
ParallelismSubagents supportedSubagents for parallel tasksUp to 8 parallel subagents, native, by default
Review loopSelf-verification before reporting backDedicated separate Codex reviewer pre-commitPlan-level review up front; ACP for multi-agent orchestration
MaturityHighest — stable in production / CIHigh — production-readyEarly beta — not production-hardened
PlatformBroad, incl. WindowsBroadBeta constraints; verify your platform

The defining workflow contrasts:

  • Grok Build front-loads control. It writes a plain-English plan listing files to touch, commands to run, and intermediate checks — and you approve, comment on individual steps, or rewrite it before a single line changes. This is the single most-requested Claude Code feature, shipped on by default (Pasquale Pillitteri).
  • Codex CLI back-loads control. Its differentiator is a separate review agent that critiques the diff before you commit — a second opinion on the output rather than approval of the plan (OpenAI Codex CLI docs).
  • Claude Code balances both with Plan Mode plus Opus 4.7's self-verification on long-horizon tasks — and wins on the boring thing that matters most in production: it is the most mature and stable of the three.

Companion guide

For the full landscape — every major terminal and editor agent, how the agentic loop works, and how to evaluate one for your team, see our complete guide to AI coding agents in 2026.

Which coding agent wins for which use case?

There is no single winner — the right answer depends on team size, budget, and how much of your usage is automated versus interactive. The decision matrix:

Your situationBest pickWhy
Solo dev / indie hacker, tight budgetClaude Code (Pro $20) or Codex CLI (Plus)Only two with a real low-cost on-ramp; Grok Build has no individual tier
Production CI/CD automation, stability-criticalClaude Code or Codex CLIBoth production-hardened; Grok Build is early beta. Model Claude's post-June-15 metered cost first
Automation-heavy Claude SDK / GitHub Actions usageRe-evaluate before June 15New Agent SDK credit pool meters programmatic use at full API rate; Codex CLI may be cheaper at scale
Large codebase, whole-repo reasoningGrok Build (if Heavy budget) or Claude Code long-contextGrok Build's vendor-reported 2M context + 8 subagents is architecturally different; verify on a 2-week trial
Want plan approval before any file changesGrok BuildPlan Mode is on by default; you edit/rewrite the plan up front
Want a second-opinion code reviewer in the loopCodex CLIDedicated separate review agent critiques the diff pre-commit
Highest single-pass benchmark accuracyCodex CLI or Claude Code~88% vendor-reported SWE-bench, effectively tied; Grok trails at 70.8%
Funded team, multi-agent orchestration ambitionsGrok Build (trial)Native parallel subagents + full ACP support; pilot for 2 weeks and measure delivery throughput

So which coding agent should you actually use?

The verdict, by profile:

  • Default / safest choice: Claude Code. It is the most mature, the most stable in production, has the cheapest entry point ($20 Pro), and Opus 4.7's self-verification reduces the worst agentic failure mode. The one asterisk: if your usage is automation-heavy, price the June 15 metered-credit change before you standardize on it.
  • Best raw scores + built-in review: Codex CLI. It leads vendor-reported SWE-bench (88.7%) and Terminal-Bench 2.0 (82%+), and its separate review agent is a genuine production safety feature. Watch usage-window depletion under heavy continuous load.
  • Best for funded teams who want plan control and parallelism: Grok Build — as a pilot. Plan Mode by default, 8 native subagents, and a vendor-reported 2M-token context are architecturally distinct. But it is early beta with no cheap tier and a lower benchmark score. Trial it for two weeks against your real workload; do not put it in production CI yet.

The meta-point: in 2026 the model gap between the top agents is small enough that workflow fit and cost structure, not benchmark deltas, should drive the decision. Run a two-week bake-off on your own codebase with your own definition of "done" — that signal beats any vendor-reported number in this article.

Who helps teams adopt agentic coding tooling well?

Standing up a coding agent across a team — harness configuration, budget guardrails, CI integration, and an honest evaluation against your codebase — is real engineering work, and getting it wrong is expensive. If you are hiring vetted remote developers experienced with agentic coding tooling, evals, and developer-platform work, Codersera matches you with engineers who have shipped this in production, with a risk-free trial so you can validate technical fit before committing.

FAQ

Is Grok Build better than Claude Code?

Not on raw benchmarks today. Grok Build's underlying coder posts ~70.8% on xAI's internal SWE-bench Verified harness versus ~87.6% for Claude Code on Opus 4.7 (both vendor-reported). Grok Build competes on architecture instead: a vendor-reported 2M-token context, up to 8 parallel subagents, and Plan Mode on by default. For production stability and a cheap entry tier, Claude Code is still the safer pick; Grok Build is early beta.

What changes with Claude Code billing on June 15, 2026?

Programmatic Claude usage — the Agent SDK, claude -p, the GitHub Actions integration, and third-party apps — moves to a separate monthly Agent SDK credit pool billed at full API rates ($20 credit on Pro, $100 on Max 5x, $200 on Max 20x, all vendor-reported). When the credit is exhausted, SDK calls fail unless you enable pay-as-you-go extra usage at API list price. Interactive terminal Claude Code and chat use still count against normal subscription limits.

Which coding agent has the highest SWE-bench score?

On vendor-reported SWE-bench Verified, Codex CLI on GPT-5.5 leads at 88.7%, Claude Code on Opus 4.7 is essentially tied at 87.6%, and Grok Build's coder reports 70.8%. The ~1-point gap between Codex and Claude is inside harness noise — each vendor runs its own evaluation harness — so it should not be the deciding factor.

Does Grok Build have a cheap individual plan?

No. There is no $20-equivalent tier. Grok Build access is bundled into SuperGrok Heavy: roughly $99/month for the first six months (introductory), then about $299–$300/month. That pricing targets funded teams, not solo developers — who are better served by Claude Code's $20 Pro tier or Codex CLI on a ChatGPT plan.

What is Plan Mode and which agents have it?

Plan Mode makes the agent write an explicit plan — files to modify, commands to run, intermediate checks — and pause for your approval before changing anything. Grok Build ships it on by default and lets you comment on or rewrite individual steps. Claude Code offers Plan Mode and adds self-verification on long tasks. Codex CLI takes the inverse approach: it gates the resulting diff with a separate review agent rather than the plan up front.

Can I use Grok Build in production CI/CD yet?

Not recommended as of mid-May 2026. Grok Build launched in early beta on May 14, 2026 and is not production-hardened; reviewers consistently flag it as a tool to trial alongside, not replace, a production agent. For CI/CD automation, Claude Code and Codex CLI remain the safer choices until Grok Build reaches general availability.

Which agent is best for large codebases?

For whole-repo reasoning, Grok Build's vendor-reported 2M-token context plus 8 parallel subagents is architecturally the most ambitious — if you have the SuperGrok Heavy budget and can validate it on a trial. Claude Code's Opus 4.7 long-context mode (up to 1M tokens) is the production-proven alternative. Verify either on a two-week pilot with your actual repository rather than trusting the context-window number alone.