AI Agent Benchmarks 2026: Who Leads SWE-bench & GAIA

Quick answer. No single model wins every benchmark in May 2026. Claude Mythos Preview leads SWE-bench Verified (93.9%), GPQA Diamond (94.6%), CharXiv (93.2%), and HLE (64.7%). GPT-5.5 tops the Artificial Analysis Intelligence Index (60) and Terminal-Bench 2.0 (82.7%). GPT-5.2 Pro leads GDPval at 74.1% win+tie versus human experts. Claude Mythos Preview leads SWE-bench Pro at 77.8% with Claude Opus 4.7 at 64.3%, the contamination-resistant coding benchmark. Gemini 3.5 Flash leads MCP Atlas tool-use (83.6%). The scores you trust depend on whether the benchmark is contamination-resistant, scaffolding-controlled, and aligned to your actual workflow.

The May 2026 benchmark landscape is messy on purpose. Five labs are shipping a new frontier model every six weeks, three different scaffolds can swing the same model 30 points up or down, and at least two of the headline benchmarks have been quietly retired by their original maintainers for contamination. This piece is a buyer's-side map of where the scores actually live, who leads what, and which numbers you should pay attention to when picking a model in 2026.

We cover ten benchmarks across coding, reasoning, multimodal, and agent work: SWE-bench Verified and Pro, Terminal-Bench 2.0, Aider Polyglot, LiveCodeBench Pro, USAMO 2026, GPQA Diamond, MMLU-Pro, GAIA, GDPval, MCP Atlas, WebArena, τ²-Bench Telecom, Humanity's Last Exam, MMMU-Pro, and CharXiv. Where leaderboards disagree, we list the spread.

How do coding benchmarks rank in May 2026?

Coding is the most contested benchmark category because it's the easiest to monetize. It's also where contamination caused the deepest reset.

Benchmark	Leader	Top score	What it actually tests
SWE-bench Verified	Claude Mythos Preview	93.9%	500 hand-curated Python PRs; high contamination
SWE-bench Pro	Claude Mythos Preview	77.8%	1,865 tasks incl. private repos; contamination-resistant
SWE-rebench	Claude Opus 4.6	65.3%	Fresh post-cutoff GitHub issues, auto-mined
Terminal-Bench 2.0	GPT-5.5	82.7%	89 multi-step Docker terminal tasks
Aider Polyglot	Claude Opus 4.5	89.4%	225 hard Exercism problems, 6 languages
LiveCodeBench Pro	Gemini 3.1 Pro	2887 Elo	Live competition problems (LeetCode/AtCoder/CF)
HumanEval	Saturated	93-95% cluster	Retired as a frontier differentiator

SWE-bench Verified: still quoted, but broken

SWE-bench Verified is the 500-task Python PR benchmark that built the "agentic coding" category. As of May 28, 2026, the top five are Claude Mythos Preview (93.9%), Claude Opus 4.8 (88.6%), Claude Opus 4.7 Adaptive (87.6%), GPT-5.5 (88.7% on Marc0.dev's snapshot), and GPT-5.3-Codex (85.0%).

The problem is everyone has contaminated it. OpenAI's internal audit found that frontier models could reproduce verbatim gold patches for some tasks. 59.4% of the hardest unsolved problems had flawed test cases. There's also an active reward-hacking vector: agents can drop a conftest.py at the repo root that survives the reset and overwrites test outcomes via a PyTest hook. A BenchJack audit found roughly 24% of tasks were affected by future-commit leakage when checked with Gemini 3 Flash. OpenAI stopped reporting Verified scores in early 2026 and now recommends Pro.

SWE-bench Pro: the honest coding leaderboard

Scale AI built SWE-bench Pro to fix Verified's contamination. It's 1,865 tasks across 41 actively maintained repos in Python, Go, TypeScript, and JavaScript. Every task requires at least 10 lines of changes (avg 107 lines across 4.1 files). The killer feature: a chunk of tasks come from private proprietary startup codebases that have never been publicly available, so contamination is legally prevented.

The top five on the Pro public dataset (May 21, 2026 snapshot):

Claude Mythos Preview — 77.8%
Claude Opus 4.7 (Adaptive) — 64.3%
Qwen3.7 Max — 60.6%
GPT-5.4 (xHigh) — 59.1%
GPT-5.3-Codex (agent system) — 56.8%

The 30-point delta between Verified and Pro is the "real" coding capability gap. Models that score 80%+ on Verified land at 46-57% on Pro. Independent harness audits (Blitzy reported 66.5% on Pro) corroborate the order. If a vendor only quotes Verified, that's a yellow flag in 2026.

SWE-rebench catches inflated scores

SWE-rebench takes a different anti-contamination approach: it auto-mines fresh GitHub issues that post-date the training cutoff of the model being evaluated. Over 21,000 interactive Python SWE tasks rotate through the pipeline. As of May 25, 2026, the top three are Claude Opus 4.6 (65.3%), GLM-5 (62.8%), and GLM-5.1 (62.7%).

SWE-rebench has been particularly useful for exposing what the maintainers call "Chinese model inflation" — several models that posted competitive Verified scores dropped significantly when re-evaluated on fresh tasks. Closed-frontier Western models drop less, but they still drop. Treat Pro and rebench as the same signal: contamination-controlled coding capability.

Terminal-Bench 2.0: the shell-only eval

Terminal-Bench 2.0 is 89 hand-audited tasks in real Docker containers covering software engineering, security, biology, and gaming. Each task got around three reviewer-hours to verify it's solvable, realistic, and well-specified. The top on the public snapshot are GPT-5.5 (82.7%), Claude Mythos Preview (82.0%), GPT-5.3-Codex (77.3%), Gemini 3.5 Flash (76.2%), and Qwen3.7 Max (69.7%). With agent scaffolding, Forge Code + Gemini 3.1 Pro hits 78.4% and Factory Droid + GPT-5.3-Codex hits 77.3%.

The signal here is "can the model actually drive a real shell." Frontier models score 65-73% directly; harness adds 8-15 points. Terminal-Bench is now the standard companion to SWE-bench Pro for coding-agent buyers — one measures multi-file diff quality, the other measures whether the model knows how to ship the diff.

Aider Polyglot and LiveCodeBench Pro: supporting signals

Aider Polyglot tests 225 Exercism problems across C++, Go, Java, JavaScript, Python, and Rust with two attempts per problem (first failure feeds unit-test output back). GPT-5 leads at 88.0% across the public leaderboard's 22 models; Anthropic reports Claude Opus 4.5 at 89.4% on its own pass. DeepSeek-V3.2-Exp is the top open-source entry at 74.5%.

LiveCodeBench Pro uses an Elo ranking based on live competition problems annotated by release date, so models can be evaluated only on post-cutoff problems. Gemini 3.1 Pro is the runaway leader at 2887 Elo, with Gemini 3 Pro and Gemini 3 Flash trailing at 2439 and 2316. Google's lead here is structural — Gemini's contest-problem training apparently transferred well.

HumanEval: saturated, stop quoting it

HumanEval is dead as a differentiator. Top four cluster at 93-95%, contamination is well-documented, and the original 164 problems have been studied so deeply that models effectively memorize them. The conventional advice now: treat HumanEval as a 90% qualification bar, use LiveCodeBench Pro and SWE-bench Pro for real model selection.

Who leads reasoning and math benchmarks?

Three of the four headline reasoning benchmarks (GPQA Diamond, MMLU-Pro, AIME) are saturating at the frontier. USAMO is the new top of the stack because it requires rigorous proofs, not multiple choice.

Benchmark	Leader	Top score	Saturation status
AIME 2026	GPT-5	100%	Solved
USAMO 2026	GPT-5.4	95%	Near-saturated
GPQA Diamond	Claude Mythos Preview	94.6%	Near-saturated
MMLU-Pro	Gemini 3.1 Pro Preview	90.99%	Saturating
HLE	Claude Mythos Preview	64.7%	Room to run

USAMO 2026 and the proof-writing frontier

MathArena scored USAMO 2026 with an LLM-jury grading pipeline that checks full proofs, not just final answers. GPT-5.4 hit 95% — effectively near-saturation. The drop-off behind it is huge: Gemini 3.1 Pro at 75%, Claude Opus 4.6 at 47%, the strongest open model (Step-3.5-Flash) at 45%. Compare this to USAMO 2025, where LLMs scored close to zero because proofs were full of circular arguments and unsupported guesses. One year, near-zero to near-saturation, only for the top model.

AIME and HMMT are completely solved. GPT-5 hits 100% on AIME 2026; frontier models cluster at 95-99% across both. Competition math at the AI tier is done.

GPQA Diamond and MMLU-Pro: saturating

GPQA Diamond, the "Google-proof" PhD-level science questions, has converged. Claude Mythos Preview leads at 94.6%, with Gemini 3.1 Pro (94.3%), Claude Opus 4.7 (94.2%), Qwen3.7 Max (92.3%), GPT-5.5 (93.5%), and GPT-5.4 (92.0%) all within a narrow band. The 1-2 point gaps you see between evaluators are inter-run variance, not model-capability gaps. MMLU-Pro is in the same place — Gemini 3.1 Pro Preview at 90.99%, Gemini 3 Pro at 90.10%, Claude Opus 4.7 at 89.87%, and Claude Opus 4.5 at 89.5%. Top models cluster within 1.7 points.

Humanity's Last Exam: where the frontier still runs

HLE is 2,500 expert-vetted questions across maths, sciences, and humanities. It's the one frontier reasoning benchmark with room to climb. Top scores depend heavily on how much compute and effort the test allows:

Claude Mythos Preview — 64.7% (BenchLM max-effort)
GPT-5.4 Pro — 58.7%
GPT-5.5 Pro — 57.2%
Claude Opus 4.8 (Adaptive, Max Effort) — 45.7% (Artificial Analysis)
Gemini 3.1 Pro Preview — 44.7%

The 20-point spread between scaffolded and bare-model HLE scores tells you how much agent harnesses still matter at the frontier.

What do multimodal benchmarks show?

Multimodal frontier is more interesting than text-only because it's where the contamination ceiling hasn't been hit yet.

MMMU-Pro combines text with images, diagrams, charts, and academic visual reasoning tasks. As of May 27, 2026: GPT-5.4 Pro leads at 94%, with Claude Mythos Preview at 92.7%, Gemini 3.1 Pro at 83.9%, and Gemini 3.5 Flash at 83.6%. The 10-point gap between the top two and the next tier is the largest in any frontier benchmark right now.

CharXiv tests scientific chart reasoning — can the model interpret plots, diagrams, and data charts from arXiv papers? Claude Mythos Preview leads at 93.2%, with Claude Opus 4.7 Adaptive at 91.0% and Meta's Muse Spark at 86.4%. Claude Opus 4.7 without tools scores 82.1%, GPT-5.4 lands at 82.8%, and Claude Opus 4.6 was at 65.3%, which is the biggest generation-over-generation visual reasoning jump we've seen this year. Claude Mythos's vision improvements show up most clearly here.

How do agent and tool-use benchmarks rank?

This is where the real frontier is in 2026. Coding benchmarks measure one diff. Agent benchmarks measure a workflow. The numbers are uglier and the scaffolding spread is wider.

Benchmark	Bare-model leader	Scaffold leader	Gap
GAIA	GPT-5 Mini — 44.8%	Claude Sonnet 4.5 (HAL) — 74.6%	~30 pts
GDPval	GPT-5.2 Pro — 74.1% W+T	Claude Opus 4.8 — 1,890 Elo	Methodology-dependent
MCP Atlas	Gemini 3.5 Flash — 83.6%	n/a (Scale-hosted harness)	Same scaffold
WebArena	Claude Mythos Preview — 68.7%	DeepSeek V3.2 + Steel.dev — 74.3%	~5-6 pts
τ²-Bench Telecom	Claude Opus 4.6 — 99.3%	JT-35B-Flash — 99.1%	Saturated

GDPval: real knowledge work, not toy tasks

GDPval is OpenAI's benchmark of real-world economically valuable knowledge work. Tasks come from industry professionals with an average of 14 years of experience and cover 44 occupations across the top 9 GDP sectors. Models are graded as win/tie/loss versus an expert deliverable.

The top scores:

GPT-5.2 Pro — 74.1% win+tie versus human industry experts
GPT-5.2 Thinking — 70.9% (OpenAI's first model to perform at or above human-expert level on average)
Claude Opus 4.8 Adaptive (Max Effort) — 1,890 Elo on the GDPval-AA Elo ladder

OpenAI's research found that GPT-5.2 Thinking produced GDPval outputs at over 11x the speed and less than 1% the cost of expert professionals, with frontier models more broadly completing tasks roughly 100x faster and 100x cheaper than human industry experts. Progress is roughly linear over time. If you're trying to answer "can I replace a junior knowledge worker on this task," GDPval is the only benchmark whose scoring even tries to measure that. It's also the most business-relevant number in the bundle.

GAIA and the scaffolding tax

GAIA is a 466-question benchmark from Meta, HuggingFace, and the AutoGPT authors. It tests reasoning, multi-modality, web browsing, and tool use on real-world assistant tasks across three levels (L1 single-step to L3 multi-step browsing + tools).

The headline numbers depend entirely on how the model is scaffolded:

Princeton HAL framework: Claude Sonnet 4.5 — 74.6% (Anthropic sweeps the top 6 HAL spots)
BenchLM published: Claude Mythos Preview — 52.3%, GPT-5.4 Pro — 50.5%, GPT-5.4 — 48.2%
Bare model: GPT-5 Mini — 44.8%, Claude 3.7 Sonnet — 43.9%

The 30-point spread between HAL-scaffolded Sonnet 4.5 and bare GPT-5 Mini is the "scaffolding tax" or, more honestly, the agent harness advantage. When a vendor shows you a GAIA number, ask which scaffold ran it. Without that context, the score is meaningless.

MCP Atlas: the real tool-use frontier

MCP Atlas is Scale's new tool-use benchmark, built on top of real Model Context Protocol servers. The dataset is 1,000 human-authored tasks spanning 36 real MCP servers and 220 tools. April 2026 brought an upgraded scoring judge and a 100-call budget per task (up from a 20-turn limit).

The top three on the 500-task public split:

Gemini 3.5 Flash — 83.6%
Claude Opus 4.7 Adaptive — 77.3%
Qwen3.7 Max — 76.4%

This is the genuine open frontier. No frontier model is above 85%. A task passes only if the coverage score is 75% or higher, so the metric is honest about partial success. MCP Atlas is the benchmark to watch in late 2026 because it actually measures the workflow most production deployments care about: connecting a model to real tools and shipping a usable result.

WebArena and τ²-Bench: the old guard

WebArena is the realistic-web-environment benchmark — e-commerce, forums, CMS, and code repos. Claude Mythos Preview leads the public snapshot at 68.7%, with GPT-5.4 Pro at 65.8% and Claude Opus 4.6 at 64.5%. Specialized agentic frameworks beat the bare leaders — OpAgent's planner-grounder-reflector-summarizer multi-agent pipeline reaches 71.6%, and DeepSeek V3.2 with the Steel.dev harness hits 74.3%.

τ²-Bench Telecom is Sierra's dual-control conversational benchmark — the agent has to guide a simulated user through technical troubleshooting. The telecom domain is now effectively saturated: JT-35B-Flash hits 99.1% on Artificial Analysis, Claude Opus 4.6 hits 99.3% on llm-stats, GPT-5.4 hits 98.9% on BenchLM. When everyone's at 99%, the benchmark stops differentiating — expect Sierra to release a harder follow-up.

What does the Artificial Analysis Index actually mean?

The Artificial Analysis Intelligence Index is the closest thing to a composite frontier score. Version 3 (2026) weights 10 evaluations: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity's Last Exam, IFBench, SciCode, Terminal-Bench Hard, and τ²-Bench Telecom.

Current top:

GPT-5.5 (xhigh) — 60
GPT-5.5 (high) — 59
Claude Opus 4.7 (Adaptive, Max Effort) — 57
Gemini 3.1 Pro Preview — tied with Opus 4.7

GPT-5.5 launched April 23, 2026 and topped the index within 24 hours — it's the first fully retrained base model since GPT-4.5, where every GPT-5 release before it (5.0 through 5.4) was post-training on the same foundation. On the head-to-head 10-benchmark slice that both providers report, GPT-5.5 leads on 14 individual benchmarks across the broader index pool, while Claude Opus 4.7 leads on 6 of the 10. The right way to read this is "GPT-5.5 has a broader average lead, Opus 4.7 has deeper specialized wins."

Which benchmarks matter for which buying decision?

If you're picking a model in 2026, the leaderboard answer depends on the job.

Picking a coding agent backbone. SWE-bench Pro (real-world coding without contamination) and Terminal-Bench 2.0 (shell agent reliability). Ignore SWE-bench Verified unless you're calibrating to vendor marketing.

Picking a tool-using agent. MCP Atlas (controlled harness, real tools) and τ²-Bench (conversational, dual-control). GAIA is useful but only if you control for scaffold.

Picking a knowledge-work replacement. GDPval is the only honest answer. Win+tie versus a 14-years-experience expert is the actual buying signal.

Picking a research assistant. HLE for room-to-improve frontier reasoning, USAMO for proof writing, MMMU-Pro for chart and diagram reasoning, CharXiv for scientific visualizations.

Looking for a general signal. Artificial Analysis Intelligence Index for one composite number; weight it against your actual workload mix.

What are the gaming and contamination risks by benchmark?

Almost every benchmark in the list above has at least one credible criticism. The honest framing in May 2026 is that there's no clean leaderboard — just leaderboards that are useful for specific decisions.

SWE-bench Verified. Verbatim gold-patch reproduction by frontier models. conftest.py reward-hacking vector. OpenAI's own audit found 59.4% of hardest unsolved problems had flawed tests. Treat as contaminated.
SWE-bench Pro. Private repos prevent direct contamination, but the public 500-task split is what most vendors quote — check whether the score is on public or private. Pro is still the cleanest coding benchmark in 2026.
HumanEval, MBPP. Both saturated and contaminated. Stop quoting them.
MMLU, MMLU-Pro, GPQA Diamond. Saturating at the frontier. Differences inside 2-3 points are noise.
GAIA. Massive scaffold dependence. A scaffold-controlled HAL number means something; a bare-model number means something different. Don't compare.
τ²-Bench Telecom. Saturated. Wait for a harder follow-up.
Aider Polyglot. Self-reported by vendors. Aider's leaderboard is trustworthy; third-party reproductions sometimes diverge by 5-10 points.
GDPval. OpenAI runs it; the judging methodology has been peer-reviewed and Artificial Analysis runs its own Elo (GDPval-AA) as a cross-check. The most trusted business-relevance score, with the caveat that "win against a human expert" depends on the rubric.

How have the benchmarks shifted since March 2026?

Two big changes since the last quarter:

First, Claude Mythos Preview swept the top of several reasoning and vision benchmarks (GPQA, CharXiv, HLE, WebArena, SWE-bench Verified) in May 2026. Anthropic's preview model is the first time in 2026 that a single model has held that many top-1 positions simultaneously. Whether it holds up at GA — and whether the scores survive third-party re-runs — will define the June/July leaderboard.

Second, GPT-5.5 reshaped the coding benchmarks. Released April 23, 2026 as the first fully retrained base since GPT-4.5, it overtook the AA Intelligence Index within 24 hours and currently leads Terminal-Bench 2.0 at 82.7%. SWE-bench Verified reports it at 88.7%, just behind Mythos. Open AI's pricing strategy on 5.5 also reset the "dollars per million coding tokens" comparison — Codex CLI now delivers more per-task than Claude Code at the $20 tier.

Frequently asked questions

Why do leaderboards disagree on the same model and benchmark?

Three reasons: scaffolding choices (raw vs HAL vs custom harness), effort/compute budgets (max-effort runs vs standard), and run-to-run variance. A 1-3 point difference on the same model is almost always one of those. A 10+ point difference usually means different scaffolds entirely — check before comparing.

Is SWE-bench Verified still worth quoting?

Mostly no. OpenAI stopped reporting Verified scores in early 2026 after finding verbatim gold-patch reproduction. SWE-bench Pro is the contamination-resistant replacement, and it produces 25-30 points lower scores that reflect real capability. If a vendor quotes Verified without Pro, ask them why.

Which benchmark best predicts real production performance?

GDPval for knowledge work, SWE-bench Pro for coding, MCP Atlas for tool-using agents, and Terminal-Bench 2.0 for shell automation. None of them perfectly predict your workload — the right move is to run your top two candidates on your own private eval set before committing.

Are Chinese open-source models really competitive in 2026?

Yes, but with an asterisk. GLM-5 and GLM-5.1 hit 62.7-62.8% on SWE-rebench, just behind Claude Opus 4.6. Qwen3.7 Max competes on GPQA, MCP Atlas, and Terminal-Bench 2.0. DeepSeek V4 Pro Max is the top open-source on Aider Polyglot. The asterisk: SWE-rebench's contamination filter has caught several Chinese providers with inflated Verified scores, so look for fresh-task or contamination-resistant numbers before standardizing.

What benchmark should I trust if I can only pick one?

The Artificial Analysis Intelligence Index, weighted by your workload. It bundles GDPval-AA, Terminal-Bench Hard, τ²-Bench, GPQA, HLE, SciCode, and IFBench into one composite. GPT-5.5 (xhigh) leads at 60; Claude Opus 4.7 trails at 57. For coding-only buyers, pair the Index with SWE-bench Pro.

How fast do these leaderboards change?

Top-3 churn every 6-8 weeks. AA Intelligence Index has changed leader four times in 2026 already (DeepSeek V4 → Gemini 3 → Claude Opus 4.7 → GPT-5.5). The right framing: don't pick a model based on benchmark leadership; pick based on which model leads on the benchmarks that match your workload — then reassess every quarter.

Do these benchmarks measure agent reliability, or just peak capability?

Mostly peak capability. Benchmark pass rates are typically "solved in N attempts with no time limit." Production reliability — consistency over thousands of runs, error recovery, cost predictability — isn't on any public leaderboard. The closest signal is GDPval's tie/loss split, which catches inconsistent quality.

The bottom line

The May 2026 leaderboard tells a consistent story across coding, reasoning, multimodal, and agent benchmarks: Claude Mythos Preview and GPT-5.5 are leapfrogging each other at the frontier, Claude Opus 4.7 is the durable second-place model on the harder contamination-resistant evals, and Gemini 3.x is the surprise on multimodal and live-code benchmarks. Open-source is closing fast at the second tier (DeepSeek, GLM, Qwen) but still trails on the contamination-controlled benchmarks.

The bigger story is that we're past the "one model wins" era. Picking a backbone for a production system in 2026 is a workload-matching exercise across at least three benchmarks, not a top-of-leaderboard glance. If you're building a team around these models — whether you're standardizing on Cursor + Claude Code, dispatching Codex agents in CI, or wiring MCP Atlas-style tool servers into your stack — that's an architecture decision, not a procurement one. Codersera helps companies hire vetted remote developers who already work fluently with these benchmarks and the agents that ride on them, so the "which leaderboard matters for us" conversation happens with engineers who've shipped against all of them.