OpenAI May 2026: GPT-5.5 Instant, Codex Goals, GPT-5.6

GPT-5.5 Instant replaced GPT-5.3 as ChatGPT's default, Codex shipped Goal Mode and richer MCP, and a GPT-5.6 entry briefly surfaced in OpenAI's Codex logs. Here is the complete May 2026 OpenAI changelog and what it means for developers.

Published 28 May 2026 • Updated 28 May 2026 • 12 min read

Quick answer. Between April 23 and May 28, 2026, OpenAI shipped GPT-5.5 as the new flagship model ($5/$30 per million tokens, 1M context, 88.7% SWE-bench), promoted GPT-5.5 Instant to ChatGPT's default with 52.5% fewer hallucinations on high-stakes prompts, rolled out ChatGPT for Excel and Google Sheets, and turned Codex CLI into a persistent autonomous agent runtime through four releases that added Goal Mode by default, conversation search, and richer MCP support. GPT-5.6 has not launched, though a Codex log entry briefly referenced it.

OpenAI is moving fast enough that a 30-day window now contains a flagship model launch, a default-model swap, two major product GAs, four Codex CLI releases, and a canary signal for the next model. The May 2026 wave is also genuinely different from prior waves: less about raw intelligence and more about turning GPT-5.5 into something that runs autonomously for hours, persists state across sessions, and shows up inside the tools people already use.

This is the consolidated changelog, with benchmarks, pricing, gotchas, and a honest read on the GPT-5.6 rumor that surfaced in Codex logs mid-month. If you are deciding whether to migrate from GPT-5.4 or another vendor, build on top of Codex, or just want one place to see what shipped — start here.

What shipped in May 2026 (the timeline)

The full sequence, ordered by ship date:

April 23, 2026 — GPT-5.5 launches in the API, ChatGPT Plus/Pro, Codex, and Copilot. $5 / $30 per million tokens, 1M context, 88.7% SWE-bench Verified, 92.4% MMLU, 82.7% Terminal-Bench.
April 25, 2026 — OpenAI publishes the GPT-5.5 prompting guide.
April 30, 2026 — Post-mortem on the “goblin incident,” the model's statistically significant fixation on goblins, gremlins, raccoons, trolls, ogres, and pigeons. Yes, really.
May 5, 2026 — GPT-5.5 Instant rolls out as the new default for every ChatGPT user. 52.5% fewer hallucinations on high-stakes prompts. ChatGPT for Excel and Google Sheets reaches general availability on the same day.
May 7, 2026 — Codex CLI v0.129.0: modal Vim editing. OpenAI also announces GPT-5.5-Cyber, a limited-preview cybersecurity variant under the Trusted Access for Cyber program.
May 18, 2026 — Codex CLI v0.131.0: blended token usage, universal “@” picker across files / dirs / plugins / skills, responsive Markdown tables.
May 18-20, 2026 — Memory Sources rollout on web for Plus and Pro; Google Calendar connector available; locked computer use for eligible Mac users.
May 21, 2026 — Codex CLI v0.133.0: Goal Mode enabled by default, with dedicated storage and per-turn progress tracking.
May 26, 2026 — Codex CLI v0.134.0: conversation history search, `--profile` as the canonical selector, per-server MCP env targeting + OAuth for streamable HTTP.

Mid-May also produced the only legitimate GPT-5.6 signal so far: a rollout-mapping entry briefly referenced gpt-5.6 in Codex logs before reverting to gpt-5.5. That is consistent with backend canary testing, but is not a launch. More on that below.

What is GPT-5.5 Instant, and why does it matter?

GPT-5.5 Instant is OpenAI's low-latency variant of GPT-5.5, and on May 5 it replaced GPT-5.3 Instant as the default model for every ChatGPT user — Free, Go, Plus, Pro, Business, Enterprise, and Edu. In the API it is reachable through the chat-latest alias.

The headline number is hallucination reduction. On OpenAI's internal “HallucinationBench” — a benchmark of medical, legal, and financial questions where wrong answers carry real consequences — the rate dropped from 18.7% on GPT-5.3 Instant to 8.9% on GPT-5.5 Instant. That is a 52.5% relative reduction. On the harder slice of conversations that real users had previously flagged for factual errors, GPT-5.5 Instant reduced inaccurate claims by 37.3%.

Two other shifts are easy to feel in everyday use:

Conciseness. 30.2% fewer words and 29.2% fewer lines on the same prompts. Fewer follow-up questions, less overformatting, and — to quote 9to5Mac's headline — fewer “gratuitous emojis.”
Reasoning lift. AIME 2025 climbs from 65.4 to 81.2. MMMU-Pro (multimodal reasoning) goes from 69.2 to 76.0.

Important nuance: the 52.5% hallucination reduction is OpenAI's number for Instant on high-stakes prompts with tool use enabled. On long-form factuality benchmarks without tool use, GPT-5.5 still hallucinates at roughly 86%, against Claude Opus 4.7's 36%. The gain comes largely from tool grounding and context engineering, not solely from the base model. That distinction matters when you are deciding whether to ship a customer-facing legal or medical surface.

GPT-5.3 Instant remains available for three months in paid users' model settings before retirement, so production migrations have a window.

What are the GPT-5.5 pricing and context window trade-offs?

Across the GPT-5.5 family, the per-million-token rates published since April 23 are:

GPT-5.5 standard: $5 input / $30 output, $0.50 cached input (a 90% discount)
GPT-5.5 Pro: $30 input / $180 output for higher-effort reasoning
Batch / Flex tier: $2.50 / $15 — same 50% discount as previous batch APIs
Priority tier: $12.50 / $75 for guaranteed-throughput workloads

Context window is where the asymmetry hides. The API exposes 1,000,000 tokens, but inside Codex the effective limit is 400,000. That mismatch has already caused production headaches — there is a tracked issue where Codex's auto-compaction logic misfires when developers pass API-sized contexts to Codex sessions and the runtime emits “exceeds the context window of this model” errors. If you are building agents that span both surfaces, pin your budget to 400K and treat the extra 600K as API-only headroom.

Long-context quality also leapt this generation. At 512K to 1M tokens, GPT-5.5 retrieves at 74.0% accuracy, against GPT-5.4's 36.6%. Retrieval-heavy use cases — full-codebase analysis, multi-document QA, long policy corpora — are tractable now in ways that required careful chunking last year.

How did Codex CLI change this month?

Codex CLI shipped four releases between May 7 and May 26. The cumulative effect is to turn Codex from an interactive coding assistant into a persistent autonomous runtime. The headline release is v0.133.0 on May 21, which made Goal Mode the default.

Goal Mode lets you define an outcome plus success criteria and walk away. Codex drives toward the goal for hours or, in OpenAI's marketing claim, “even days.” Goal progress persists across turns, sessions, and machine state. It is generally available across the CLI, the IDE extension, and the ChatGPT app.

The other notable May Codex changes:

v0.129.0 (May 7): Modal Vim editing in the TUI composer (/vim, default-mode config, Vim keymap contexts).
v0.131.0 (May 18): Service-tier commands are data-driven; blended token usage is exposed; the “@” picker now searches files, directories, plugins, and skills in one shot; Markdown tables render responsively on narrow terminals.
v0.134.0 (May 26): Search across local conversation history with content-match previews; --profile becomes the canonical profile selector across CLI, TUI permissions, and sandbox flows (legacy profile configs are rejected with migration guidance); MCP setup supports per-server environment targeting and OAuth on streamable HTTP servers; read-only MCP tools advertised with readOnlyHint run concurrently for parallel speedups.

For comparing this evolution against alternatives, see Claude Code vs OpenAI Codex: which agent runtime should developers pick in 2026 and the cross-vendor agent shell comparison.

How does ChatGPT for Excel and Google Sheets work?

ChatGPT for Excel and Google Sheets went generally available on May 5, with GPT-5.5 available in the model picker. Functionally, it adds a sidebar inside Microsoft Excel and Google Sheets that can build, update, and reason about the spreadsheet you are working in. It handles trackers, budgets, formulas, multi-tab files, scenario analysis, and data cleanup.

Two architectural touches make it more than another “ChatGPT in a sidebar” surface:

Skills are reusable playbooks that teach ChatGPT how to handle specific spreadsheet workflows, formats, and review steps for your org.
Apps let the spreadsheet sidebar connect to outside data sources — financial integrations, internal databases — so reasoning is grounded in the right context instead of pure formula inference.

The feature is globally available to Free, Go, Plus, and Pro users, plus Business, Enterprise, Edu, and K-12. The Business and Enterprise tiers have a free preview through June 2, 2026; after that usage follows each plan's credit allocations.

What are Memory Sources, and how do they affect privacy?

Memory Sources is the visible context layer that rolled out alongside GPT-5.5 Instant. It is now live for Plus and Pro on the web, with mobile and lower tiers following over the coming weeks.

When a response is personalized, you can see exactly which sources ChatGPT pulled from — past chats, saved memories, custom instructions, files in your library, and emails from a connected Gmail account. Each source can be corrected, deleted, or marked as “not relevant.” That latter affordance is the meaningful one: users can finally see why ChatGPT made a personalized recommendation, and correct it before it propagates further. Shared chats do not include the source list, so privacy is preserved at the share boundary.

The Google Calendar integration extends this — connect your calendar and ChatGPT can draft meeting agendas, surface birthday-reminder context, or recommend restaurants for upcoming trips. The privacy tradeoff is the obvious one: anything Gmail or Calendar contains can now surface in answers if you connect those accounts. If you forget you connected Gmail and ask about “our Q2 strategy,” the answer may reference unrelated email threads. Audit the Memory Sources panel periodically.

How does GPT-5.5 compare to Claude Opus 4.7 and Gemini 3.1 Pro?

The most useful frame for May 2026 is that the three flagship models sit within roughly three points of each other on the Artificial Analysis intelligence index. They are no longer differentiated by raw capability; they are differentiated by what they win at.

GPT-5.5 leads on agentic execution — Terminal-Bench (82.7% vs Opus 4.7's 69.4%) and OSWorld (78.7%). It also has the new Codex Goal Mode runtime and the deepest first-party tool integrations.
Claude Opus 4.7 leads on SWE-bench Pro (64.3% vs GPT-5.5's 58.6%), long-form factuality (36% hallucination rate vs 86% on the same long-form benchmark), and writing quality. It still wins enterprise risk-averse use cases.
Gemini 3.1 Pro leads on pure reasoning — GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%) — and price (output is roughly $12 per million, less than half of GPT-5.5 or Opus 4.7).

Pricing comparison per million tokens:

GPT-5.5: $5 / $30 (cached $0.50)
GPT-5.5 Pro: $30 / $180
Claude Opus 4.7: $5 / $25 — but a new tokenizer adds 0-35% effective cost on the same prompts
Gemini 3.1 Pro: ~$2 / $12
Gemini 3.5 Flash (May 20 GA): $1.50 / $9, and beats Gemini 3.1 Pro on coding and agentic benchmarks

The decision tree most teams are landing on: GPT-5.5 for autonomous agents and Office-embedded workflows, Opus 4.7 for code review and high-trust long-form, Gemini 3.x for cost-sensitive volume or multimodal video. Full background on each in our GPT-5.5 complete guide, the Claude Opus 4.7 guide, and the Gemini 3.5 guide.

What is the GPT-5.6 signal, and should you wait?

Mid-May, a Codex session briefly returned a rollout-mapping log entry that referenced gpt-5.6 instead of gpt-5.5. The entry was reproducible for a short window, then disappeared. The pattern is consistent with backend canary testing — a small percentage of production traffic gets routed to an experimental build to measure performance and behavior. GPT-5.5 itself surfaced in Codex logs roughly 10 to 14 days before its April 23 launch.

What the signal tells you:

GPT-5.6 is plausibly in development.
Polymarket traders price an 80-89% probability of a public release by June 30, 2026.
OpenAI has not published a model card, an API endpoint, benchmarks, or a release date.

What the signal does not tell you: anything substantive about capabilities, context window, pricing, or whether it ships as a default replacement or a parallel SKU. The pragmatic recommendation for anyone planning a Q3 build is to ship on GPT-5.5 now and keep your model id configurable so the migration is a one-line change. Treat any roadmap that assumes GPT-5.6 ships in June as a bet on Polymarket-implied odds, not a vendor commitment. We covered the rumor lifecycle in more depth in our earlier post, GPT-5.6 release date, status, and what is real vs rumored.

How should developers migrate and build around this wave?

Three migration patterns are showing up in production deployments this month:

Drop-in upgrade from GPT-5.4 to GPT-5.5. The API is backward compatible. Pin the versioned string gpt-5.5 (not gpt-5.5-latest) so production stays stable. Expect a real quality lift on long-context retrieval and agentic execution, modest gains on simple chat, and possibly a small cost increase since reasoning effort defaults are slightly higher.
Codex CLI migration to Goal Mode. Move long-running developer tasks from chat-with-tools into goal definitions. The biggest unlock is delegating work that genuinely takes hours — multi-file refactors, full test sweeps, dependency migrations — and walking away. Validate the success criteria contract carefully; vague goals produce drift.
Reasoning cost discipline. Set reasoning.effort to low for simple tasks and reserve high for genuinely hard problems. A single high-effort call on a long prompt can consume 20K reasoning tokens at $30 per million — $0.60 per call before the final output, and those tokens also count against your context window.

A minimal example pinning the version and using reasoning:

from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-5.5",                # pin a versioned string
    messages=[
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Refactor this function to use asyncio..."},
    ],
    reasoning={"effort": "high"},
)

For batch workloads, route everything through the Batch API at 50% off — it is the right default for evaluations, bulk grading, scheduled reports, and any workflow that does not need an interactive response.

What are the real developer gotchas this month?

Five worth flagging:

Codex 400K vs API 1M context mismatch. Feeding Codex a prompt sized for the 1M API limit crashes with “exceeds the context window of this model.” Cap at 400K inside Codex.
Reasoning token billing. Thinking tokens bill at the output rate, not a separate tier. They also count against your context budget. Be deliberate about reasoning.effort.
Long-form hallucination. GPT-5.5's 86% rate on long-form factuality stands in stark contrast to Opus 4.7's 36%. For multi-paragraph factual generation without retrieval grounding, this is the model's biggest weakness.
Codex 0.134 profile migration. Legacy profile configs are rejected. Either pass --profile explicitly or update your config files to the new schema.
Memory Sources privacy. If you connect Gmail or Calendar and forget, related email content will surface in responses. Audit the Memory Sources panel before assuming anything is private.

What is still missing from the May 2026 wave?

For all the cadence, three things did not ship and are worth tracking:

No native multimodal video understanding to match Gemini's roadmap. Sora remains separate; there is no “GPT-5.5-video” that ingests video frames directly in chat.
No long-context tier beyond 1M, while Gemini 3.5 Pro is expected to ship with 2M tokens. For full-codebase or full-corpus jobs, Gemini still owns the high end.
No first-party answer to Anthropic's pricing discipline — Opus 4.7's effective cost can rise 35% from the new tokenizer, but its base $25 output rate is still cheaper than GPT-5.5's $30. OpenAI did not cut prices in May despite Gemini 3.5 Flash landing at $1.50 / $9.

Frequently asked questions

Did GPT-5.6 launch in May 2026?

No. A Codex log entry briefly referenced gpt-5.6 mid-May, consistent with backend canary testing, but there is no model card, no API endpoint, no benchmarks, and no published release date. Polymarket traders give roughly 80-89% odds of a public release by June 30, 2026, which is a betting market signal, not a vendor commitment.

What replaced GPT-5.3 Instant as the ChatGPT default?

GPT-5.5 Instant, on May 5, 2026. It is the new default model for every ChatGPT tier — Free, Go, Plus, Pro, Business, Enterprise, and Edu. Paid users can still select GPT-5.3 Instant from model settings for three months before retirement.

What does GPT-5.5 cost on the API?

$5 per million input tokens and $30 per million output tokens at standard tier. Cached input drops to $0.50 per million (a 90% discount). Batch and Flex tiers run at $2.50 / $15. The Pro variant is $30 / $180 for higher-effort reasoning. Reasoning tokens bill at the output rate and count against your context window.

How big is the GPT-5.5 context window?

1,000,000 tokens in the API and 400,000 tokens inside Codex CLI. The mismatch matters in practice — Codex sessions that approach the API's 1M limit fail with a context-window error. Cap your effective context at 400K when targeting both surfaces.

What is Codex Goal Mode, and is it production-ready?

Goal Mode lets you define an outcome and success criteria, and Codex drives toward that goal autonomously for hours or days. It became default-on in Codex CLI v0.133.0 on May 21, 2026, with dedicated persistent storage and progress tracking across turns. It is generally available in CLI, the IDE extension, and the ChatGPT app. Production-readiness depends on how tight your success criteria are — vague goals produce drift on long runs.

How does GPT-5.5 compare to Claude Opus 4.7 for coding?

Opus 4.7 leads on SWE-bench Pro at 64.3% vs GPT-5.5's 58.6%, and its 36% long-form hallucination rate beats GPT-5.5's 86%. GPT-5.5 leads on autonomous execution — Terminal-Bench at 82.7% vs Opus 4.7's 69.4%, and OSWorld at 78.7%. Pick GPT-5.5 for agentic computer-use tasks; pick Opus 4.7 for code review and long-form output where factuality matters.

Should I migrate from GPT-5.4 to GPT-5.5 now?

Yes if you use long-context retrieval, agentic execution, or Codex CLI — GPT-5.5 brings 74% long-context accuracy at 512K-1M tokens vs GPT-5.4's 36.6%, plus access to Codex Goal Mode. For pure chat completion at low reasoning effort, the upgrade is incremental. Pin the versioned model id gpt-5.5 (avoid gpt-5.5-latest) so production stays stable.

What is the safest way to handle Memory Sources privacy?

Audit the Memory Sources panel in your ChatGPT Plus or Pro account before sharing any chat. If you have connected Gmail or Calendar, related content will surface in responses when context is relevant — even from threads you forgot existed. Shared chats do not include the source list, so the privacy boundary holds at the share point, but the in-account experience does pull from connected accounts.

This roundup will be updated when GPT-5.6 launches, or by July 1, 2026 — whichever comes first. Last updated May 28, 2026.