AI Model Releases — May 2026 Roundup

Quick answer. May 2026 was the busiest model month of the year so far. Three standout May releases: Google's Gemini 3.5 Flash (now the default Gemini model, beats Gemini 3.1 Pro on coding), Alibaba's Qwen 3.7 Max (1M-token context, the strongest Chinese model on Artificial Analysis), and Subquadratic's SubQ with a 12-million-token context window — the largest ever shipped. xAI shipped four updates (Grok 4.3, Build, Skills, Connectors) and Cohere released the open-weight Command A+ (218B MoE). Anthropic's Mythos (released April 7, carried over from April) remains private behind Project Glasswing and continues to define the ceiling.

The May 2026 release window stretched from late April through Google I/O on May 19, and produced more shipped models in three weeks than all of Q1 combined. Every release below has a confirmed launch date, a real benchmark or price, and a place you can actually try it (or a documented reason you can't).

What were the most important AI model releases in May 2026?

If you only have time to scan one table this week, scan this one. Everything below it expands on what makes each release substantive — or in a couple of cases, what makes it more hype than substance.

Model / ReleaseDateHeadline capabilityWhere to try it
Anthropic Claude Mythos PreviewApr 7, 202697.6% USAMO 2026; cybersecurity-grade reasoningProject Glasswing partners only
Google Gemini 3.5 Flash + SparkMay 19, 202676.2% Terminal-Bench; agentic defaultGemini app, AI Studio, Antigravity
Alibaba Qwen 3.7 MaxMay 20, 20261M-token context; 92.4 GPQA DiamondQwen Chat, Alibaba Cloud Model Studio
Mistral Medium 3.5 + Le Chat Work ModeApr 29, 2026128B dense; 77.6% SWE-Bench VerifiedLe Chat (Work Mode), Hugging Face
Baidu ERNIE 5.1May 8, 2026#4 globally on LMArena Search Arenaernie.baidu.com, Qianfan API
Subquadratic SubQ 1M-PreviewMay 5, 202612M-token context window (largest ever)API beta, SubQ Code CLI
ChatGPT Personal FinanceMay 15, 2026Plaid-backed bank account integrationChatGPT Pro (US, web + iOS)
OpenAI Codex MobileMay 14, 2026Approve coding tasks from your phoneChatGPT iOS & Android (all plans)
Klarna Shopping Search in ChatGPTMay 20, 2026100M products, 400M merchant listingsChatGPT app directory
xAI Grok 4.3May 4, 20261M context; 53 on Intelligence Index; native videogrok.com, xAI API
xAI Grok Build 0.1May 14, 2026Agentic coding model (early access)grok.com (Build)
xAI Grok SkillsMay 18, 2026Persistent custom-expertise skill systemgrok.com
xAI Platform ConnectorsMay 22, 2026Vercel, Canva, Gamma, S&P Global integrationsgrok.com Connectors
Cohere Command A+May 20, 2026218B sparse MoE (25B active); Apache 2.0; runs on 2x H100Hugging Face, Cohere API

Why is Anthropic's Mythos still private behind Project Glasswing?

Anthropic announced Claude Mythos Preview on April 7, 2026 and explicitly chose not to release it to the public. It leads this roundup despite being unshippable because, by Anthropic's own numbers, Mythos is the largest single-generation capability jump the lab has ever produced — concentrated in domains the rest of the industry treats as ceilings.

The headline number is USAMO 2026, the US Mathematical Olympiad. Opus 4.6 scores 42.3%. Mythos Preview scores 97.6%. That's a 55-percentage-point jump inside a single generation, on a benchmark designed to break frontier models. Mythos also posts 93.9% on SWE-bench Verified, comfortably ahead of Opus 4.7.

Instead of an API, Anthropic launched Project Glasswing: a partner program using Mythos to find zero-days in critical software. Launch partners include AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA and Palo Alto Networks. Within its first month, Glasswing surfaced over 10,000 vulnerabilities across operating systems and browsers. Anthropic's stated position is that no current safeguard prevents Mythos-class models from being misused at scale, so public release is deferred. For Glasswing partners with direct access, Mythos is priced at $25 per 1M input tokens / $125 per 1M output tokens — roughly 5× the Opus 4.7 sticker — across an active program of roughly 50 vetted organisations (the named launch partners plus ~40 additional critical-infrastructure operators).

What this means in practice: if you're a developer, you're using Opus 4.7 (or 4.6). Mythos is the dotted line on the chart showing where Anthropic thinks the next public release will land. For the publicly-available branch, see our Claude Opus 4.7 complete guide.

What did Google ship at I/O 2026?

Google's I/O on May 19 was a pure-agent keynote. The model: Gemini 3.5 Flash, now the default for the Gemini app and AI Mode in Search globally. The interesting move is that Google made Flash — the small, fast tier — its agentic flagship. The numbers back the rebrand: 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, 84.2% on CharXiv Reasoning, and a 1656 Elo on GDPval-AA. Google claims Flash beats Gemini 3.1 Pro on agentic and coding benchmarks while running at roughly 4× the tokens-per-second of other frontier models. List price is $1.50 per 1M input tokens / $9.00 per 1M output tokens, undercutting GPT-5.5 and Claude Opus 4.7 on both sides of the meter.

Gemini Spark wraps that model into a personal agent layered into Gmail, Calendar, Docs, and the Android system. It's gated to Google AI Ultra subscribers in the US for now, with trusted tester access at launch. Spark plus the new Google Antigravity agent-first development platform make clear what Google's bet is: not chatbots, not search, but always-on action.

One footnote worth knowing: Gemini 3.5 Pro was delayed at I/O and is now expected next month. If you're benchmarking against Pro tier, you're still on Gemini 3.1 Pro until June.

How does Alibaba's Qwen 3.7 Max compare to the Western frontier?

Alibaba dropped Qwen 3.7 Max at the Alibaba Cloud Summit in Hangzhou on May 20. It is currently the highest-ranked Chinese model on the Artificial Analysis Intelligence Index, scoring 56.6 at launch — top-10 overall across 151 measured models.

The specifics that matter to developers:

  • 1 million-token context window, doubling the previous Qwen Max ceiling
  • 92.4 on GPQA Diamond — ahead of Claude Opus 4.6 Max (91.3), behind GPT-5.5 (93.6)
  • 97.1 on HMMT 2026 February, the highest score in its comparison group
  • $2.50 per 1M input tokens / $7.50 per 1M output tokens, which is roughly half the cost of GPT-5.5 at comparable quality
  • Alibaba's internal coding run logged 1,158 tool calls across 35 hours of autonomous execution

Qwen 3.7 Max is the first release where the China-US gap stops looking like a gap on the metrics most teams care about: long context, reasoning, agentic tool use, price per million tokens. If you've been writing off Chinese semi-open models, re-evaluate on this one.

What's actually new in Mistral Medium 3.5 and Le Chat Work Mode?

Mistral's release on April 29 (announcement coverage rolled into early May) was the cleanest architectural story of the month. Mistral Medium 3.5 is a 128B dense model with a 256k context window that folds instruction-following, reasoning, and coding into a single set of weights — no separate "reasoning mode" toggle, no router between models. Configurable reasoning effort per request lets you trade latency for depth at inference time.

On benchmarks, Medium 3.5 posts 77.6% on SWE-Bench Verified, beating Devstral 2 and Qwen3.5 397B A17B. It runs on four GPUs and ships with open weights on Hugging Face under a modified MIT license. List pricing is $1.50 per million input tokens, which Mistral positions as the most aggressive pricing in the 100B+ dense tier.

Le Chat Work Mode shipped alongside the model: an agentic surface in Le Chat (Free / Pro / Team / Enterprise) where Medium 3.5 works across email, calendar, documents, Jira, and Slack in parallel, surfacing tool calls and reasoning steps inline with explicit approval before sensitive actions. It's the closest thing Mistral has to Anthropic's computer-use story or Gemini Spark — and unlike both, you can self-host the underlying model.

If your team is evaluating coding agents, Mistral Medium 3.5 plus its companion Vibe "Remote Agents" feature put Mistral firmly back into the conversation. See our AI coding agents complete guide for how it stacks against Claude Code, Cursor, Codex, and Cline.

How did Baidu's ERNIE 5.1 climb the LMArena leaderboard?

Baidu released ERNIE 5.1 on May 8, 2026 (the preview had been quietly live since April 30). It is the first Chinese model to crack the global top 10 on LMArena's Search Arena leaderboard, landing at #4 globally with an Elo of 1223. On the standard text leaderboard, ERNIE 5.1 hit #14 — also the highest position any Chinese lab has held in that category.

The technical story is efficiency, not raw scale. ERNIE 5.1 compresses total parameters to roughly one-third of ERNIE 5.0 and active parameters to one-half, while spending only 6% of the pre-training compute of comparable frontier models. On AIME26 (with tool use), it scores 99.6 — second only to Gemini 3.1 Pro on that specific benchmark.

ERNIE 5.1 has a 128k context window, prices below Western frontier models on Baidu's Qianfan platform, and supports both API access and direct chat at ernie.baidu.com. For teams operating in or selling into China, it's now the default reasonable choice; for international teams, it's the model to benchmark when cost-per-token starts mattering more than the last percentage point on a leaderboard.

Is Subquadratic's 12-million-token context window real?

Of every release this month, Subquadratic's SubQ is the most architecturally ambitious. Launched May 5 by a Miami-based startup (founded by ex-SoftBank Vision Fund and Tinder alumni, $29M raised at a $500M valuation), SubQ ships with a 12 million-token context window — roughly 10× larger than any frontier model currently in production.

The architecture story: SubQ is the first model the team is calling "fully subquadratic." For each query token, the model selects a small subset of positions to attend to based on content rather than fixed patterns, then computes exact attention only over those. The claimed performance: a 7.2× speedup at 128K tokens, 52.2× at 1M tokens, and 92.1% on a needle-in-a-haystack benchmark at 12M tokens — a context length where every other frontier model either OOMs or hallucinates.

Two products shipped with the model: an API exposing the full 12M window, and SubQ Code, a CLI coding agent. Both run on neoclouds rather than the major hyperscalers. The roadmap calls for 50M-token context by Q4 2026.

The honest caveat: independent researchers have asked for third-party reproductions of the speedup numbers, and Subquadratic has not yet published a full architecture paper. We'd recommend treating SubQ as the most interesting beta of the month rather than a production dependency, but the long-context demos are real and worth running against your own codebase or document corpus.

What did OpenAI ship in May 2026?

OpenAI didn't launch a new flagship model in May — GPT-5.5 still anchors the lineup, and our GPT-5.5 complete guide covers the current state. But OpenAI shipped three product-layer expansions worth tracking.

ChatGPT Personal Finance (May 15). A preview for ChatGPT Pro subscribers in the US, letting users connect their accounts to over 12,000 financial institutions (Schwab, Fidelity, Chase, Robinhood, Amex, Capital One) via Plaid. The interface is a dashboard plus natural-language Q&A — "how much did I spend on subscriptions last quarter," "what would selling $5k of NVDA do to my tax bill." Intuit integration is on the roadmap. This follows OpenAI's April acquisition of personal-finance startup Hiro.

OpenAI Codex Mobile (May 14). The ChatGPT mobile app on iOS and Android became a remote terminal for Codex. From your phone, you can monitor running coding threads, approve or reject commands, switch models, or kick off a new task. Files, credentials, and execution stay on whatever physical or remote machine Codex is set up on — the phone is purely the control surface. At launch, mobile only connects to the macOS Codex desktop app, with Windows promised but undated. Free, Plus, Pro, and Go plans all get access.

Klarna Shopping Search (May 20). Not an OpenAI product, but a notable distribution moment. Klarna's app in ChatGPT exposes 100 million products and 400 million merchant listings across 13 markets through ChatGPT's natural-language interface, powered by Klarna's Product Search MCP server. AI-driven traffic to retail sites grew nearly 700% during the 2025 holiday season at 31% higher conversion rates — and Klarna is the first large commerce platform to ship an MCP integration directly into ChatGPT at this scale.

What's happening with Manus and the Meta acquisition?

This is the regulatory story of the month, not a model release. On April 27, 2026, China's National Development and Reform Commission ordered Meta to unwind its $2 billion acquisition of Manus, the Chinese-founded (Singapore-headquartered) AI agent startup. The deal had been under review by China's Ministry of Commerce since January under export-control and overseas-investment frameworks.

Manus had launched its Desktop App with "My Computer" in mid-March, bringing its agent onto local Macs and PCs — with terminal access, local file management, GPU utilization for inference, and per-command approval gates. That release made Manus one of the more credible computer-use products on the market — part of what made the acquisition interesting and part of what made Beijing block it.

The practical impact: Manus stays independent (and Chinese-rooted) for now. Meta's bid to leapfrog into the agent layer via acquisition is dead. The agent-wars dynamic — Anthropic computer use, Gemini Spark, Le Chat Work Mode, OpenAI Codex mobile, Manus My Computer — now has one fewer pending consolidation.

How does May 2026 compare to the rest of the year?

Three patterns are now clear five months into 2026:

Frontier capability is consolidating around agents, not chat. Every major lab shipped an agentic product surface this month — Gemini Spark, Le Chat Work Mode, Codex Mobile, SubQ Code, Manus My Computer. The thing being sold is no longer "a smarter model" but "a model that can complete a multi-step task while you do something else."

The China-US capability gap has narrowed to negligible on most benchmarks. Qwen 3.7 Max is genuinely competitive with GPT-5.5 on reasoning and ahead of Opus 4.6 on GPQA. ERNIE 5.1 cracked the LMArena global top 10. Pricing on both is roughly half of Western frontier APIs. For context on the broader open and semi-open ecosystem, see our DeepSeek V4 complete guide.

Context-window scaling has become its own race. Qwen 3.7 Max ships 1M tokens. SubQ ships 12M. The 50M roadmap goal is publicly stated. For codebases, document review, and long-running agent traces, this is the dimension that will reshape how teams structure their RAG pipelines in the second half of the year.

Which May 2026 model should I actually use?

Three rules of thumb, distilled from running these models against real workloads over the past three weeks:

  • For agentic coding workflows: Gemini 3.5 Flash if you're already in the Google ecosystem; Mistral Medium 3.5 if you need self-hosting or open weights; Claude Opus 4.7 if you want the most polished agentic harness today. Mythos is unavailable.
  • For long-context document and codebase reasoning: Qwen 3.7 Max at 1M tokens is the production-safe pick. SubQ at 12M is the experimental pick — worth testing, not yet worth shipping into customer-facing systems.
  • For cost-sensitive deployments: ERNIE 5.1 if your users are in China or the data-residency constraints work; Mistral Medium 3.5 self-hosted for everyone else; Qwen 3.7 Max if you need API simplicity and don't mind paying $2.50/M tokens.

The next dates to watch: Gemini 3.5 Pro (expected June 2026), the next OpenAI flagship (no confirmed date), and any sign that Anthropic is preparing to lift Mythos out of Project Glasswing for limited public access.

What did xAI release in May 2026?

xAI used May to ship four updates back-to-back. Grok 4.3 landed on May 4 as the cost-efficient flagship: a 1M-token context window, a score of 53 on Artificial Analysis's Intelligence Index, and native video input — a meaningful capability bump over Grok 4.2 without a price increase. Ten days later, on May 14, xAI opened early access to Grok Build 0.1, the team's first dedicated agentic-coding model, positioned against Claude Code and Codex rather than the general chat assistant.

The second half of the month was platform plumbing. Grok Skills (announced May 18) is a persistent custom-expertise system — closer to ChatGPT's GPTs than Anthropic's skills, but tuned for the Grok-on-X distribution. Platform Connectors followed on May 22 with integrations into Vercel, Canva, Gamma, and S&P Global, giving Grok read/write access to creative and financial tools without leaving the chat surface. None of these individually rivals Gemini Spark or Le Chat Work Mode, but combined they make Grok a credible fourth pillar in the agent-wars narrative alongside Anthropic, Google, and OpenAI.

What did Cohere release in May 2026?

Cohere shipped Command A+ on May 20-21, 2026 — a 218-billion-parameter sparse mixture-of-experts model with roughly 25 billion active parameters per token, released under the Apache 2.0 license with open weights on Hugging Face. The headline systems story is that Command A+ runs on as few as two NVIDIA H100 GPUs, putting frontier-class agentic capability inside the budget of mid-sized enterprises and sovereign-cloud operators that can't or won't run hyperscaler APIs.

Command A+ is explicitly positioned for agentic workflows in regulated industries — banking, defence, healthcare, sovereign-critical infrastructure — where data residency and model provenance matter more than the last percentage point on a public leaderboard. With Apache 2.0 weights, on-prem deployment is straightforward; with sparse MoE routing, inference cost stays low even at 218B total parameters. It is the most enterprise-shaped open-weight release of the month and the strongest answer yet to "what runs locally and still feels frontier."

What about DeepSeek, Meta, and other labs?

Two notable absences worth closing the loop on. No DeepSeek V5 in May — DeepSeek's most recent flagship is V4, released April 24, 2026, with V4-Pro and V4-Flash variants following shortly after; see our DeepSeek V4 complete guide for the full breakdown. No Llama 5 in May either: Meta did not release a new Llama model, and reports suggest the Llama 5 timeline has slipped while Meta Superintelligence Labs pivots toward a closed-source successor. If you were waiting on a fresh open-weight Meta flagship to compare against Command A+ or Qwen 3.7 Max, the wait continues.

Frequently asked questions

What was the biggest AI model released in May 2026?

By measurable impact on the leaderboards, Google Gemini 3.5 Flash — it became the default Gemini model on May 19 and outperforms the previous Pro tier on agentic and coding benchmarks. By architectural significance, Subquadratic's SubQ, which shipped a 12-million-token context window — roughly 10× larger than any other frontier model. Anthropic's Mythos (April 7) posts higher raw benchmark numbers but is not available to the public.

Is Anthropic's Mythos available to use?

No. Mythos Preview is restricted to Project Glasswing partners — AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks — for cybersecurity research. Anthropic has publicly stated that no current safeguards are strong enough to prevent misuse of Mythos-class models at scale, so general public release is deferred.

How does Qwen 3.7 Max compare to GPT-5.5?

On GPQA Diamond, Qwen 3.7 Max scores 92.4 vs GPT-5.5's 93.6 — close enough that the difference disappears inside benchmark noise. Qwen ships a larger context window (1M vs GPT-5.5's smaller production window) and prices roughly half as much per million input tokens. GPT-5.5 still leads on multimodal benchmarks and has the more mature ecosystem of tools and integrations.

What is Gemini Spark, and how is it different from Gemini 3.5 Flash?

Gemini 3.5 Flash is the underlying model. Gemini Spark is a personal AI agent product built on top of Flash that runs 24/7 across Gmail, Calendar, Docs, and Android. Think of Flash as the engine and Spark as the car. Spark is currently in trusted-tester rollout, with Google AI Ultra subscribers in the US getting beta access first.

Can Subquadratic's SubQ really handle 12 million tokens?

Subquadratic's own benchmarks claim 92.1% on a needle-in-a-haystack test at 12M tokens, which is real if the numbers reproduce. Independent researchers have asked for third-party validation and a full architecture paper, neither of which has been published yet. Treat SubQ as the most interesting long-context experiment of the year, but not a production dependency for customer-facing systems until the numbers are independently confirmed.

What happened with Meta's acquisition of Manus?

China's National Development and Reform Commission blocked Meta's $2 billion acquisition of Manus on April 27, 2026, citing export-control and overseas-investment concerns. Manus remains independent. The startup had launched its Desktop App with the "My Computer" feature in mid-March, which made its agent platform one of the more credible computer-use products on the market.

Which May 2026 model is best for coding?

For most teams: Claude Opus 4.7 or GPT-5.5 remain the safest defaults. Among the May 2026 releases specifically, Gemini 3.5 Flash leads on Terminal-Bench 2.1 (76.2%) and Mistral Medium 3.5 leads on SWE-Bench Verified (77.6%). Mistral has the additional advantage of open weights and self-hosting, which matters for teams with code-residency constraints.

Is Mistral Medium 3.5 open source?

Yes, with caveats. Mistral Medium 3.5 ships as open weights on Hugging Face under a modified MIT license. The model is a 128B dense architecture and runs on four GPUs. The Le Chat Work Mode product layer that wraps it is proprietary, but the underlying model can be self-hosted and fine-tuned.