GPT-5.5: The Complete Developer Guide (2026)

What GPT-5.5 actually ships, what it costs, where it leads on agentic and tool-use reliability, and where Claude Opus 4.7 or DeepSeek V4 are the better pick.

Last updated: May 1, 2026.

OpenAI shipped GPT-5.5 on April 23, 2026, with API access following on April 24. It is, by every public benchmark we have, the strongest agentic coding model on the market today, and the first OpenAI model that genuinely feels like it can sit inside a software engineering pipeline and carry multi-step work end to end without constant supervision. It is also expensive, verbose at high reasoning effort, and not the right default for every job. This guide walks through what the model actually does, what the variants and reasoning levels mean for cost and latency, how it compares to Claude Opus 4.7 and DeepSeek V4 Pro, and where each one wins.

TL;DR

  • Two API variants: gpt-5.5 and gpt-5.5-pro. Five reasoning levels: none, low, medium (default), high, xhigh.
  • Pricing: $5 / $30 per 1M input/output tokens for gpt-5.5, with a 90% cached-input discount ($0.50). gpt-5.5-pro is $30 / $180 per 1M and does not offer a cached-input discount. Batch API is 50% off.
  • Context: 1,000,000 tokens.
  • Headline benchmarks: 82.7% on Terminal-Bench 2.0 (state of the art), 82.6% on SWE-bench Verified, 58.6% on SWE-bench Pro, Intelligence Index 60 on Artificial Analysis (xhigh).
  • Where it wins: agentic tool use, long-running terminal/CLI workflows, token efficiency vs. Claude Opus 4.7 (~72% fewer output tokens on identical coding tasks).
  • Where it loses: Claude Opus 4.7 still leads on SWE-bench Pro (64.3% vs 58.6%) and broad architectural reasoning. DeepSeek V4 Pro is roughly 1/7th the price for ~85% of the capability on most non-frontier work.
  • Recommended API surface: Responses API, not Chat Completions. Web search and computer use are first-class tools.

What GPT-5.5 actually is

GPT-5.5 (codename "Spud") is OpenAI's flagship general-purpose reasoning model, succeeding GPT-5.4 and the GPT-5.3-Codex variant. OpenAI positions it as their strongest agentic coding model, but the model card and the Artificial Analysis evaluations both emphasize that the gains are broader than coding: it is also better at knowledge work, document and spreadsheet manipulation, computer use, and operating across tools without losing the thread of the task.

The most useful way to think about it: GPT-5.5 is the first OpenAI model where the "agent loop" feels like a first-class product surface rather than a wrapper over a chat model. That shows up in the API design (the Responses API has displaced Chat Completions for new projects), in tooling (built-in web search, computer use, file search, code interpreter, and remote MCP all live in the same primitive), and in the way Codex CLI and the Agent SDK route through GPT-5.5 by default.

If you're new to the GPT-5 family, our DeepSeek V3.1 Terminus vs. ChatGPT-5 vs. Claude 4.1 comparison covers the GPT-5 baseline; this guide assumes that lineage and focuses on what changes at 5.5.

Variants, reasoning levels, and pricing

OpenAI ships GPT-5.5 in two API SKUs and three ChatGPT surfaces. The reasoning-effort knob is the single most consequential dial in your API request — it changes both quality and the number of "reasoning tokens" you're billed for.

API SKUs and pricing per 1M tokens

Model Input Cached input Output Batch (50% off) Context Best for
gpt-5.5 $5.00 $0.50 (90% off) $30.00 $2.50 in / $15.00 out 1,000,000 Default agentic coding, knowledge work, tool use
gpt-5.5-pro $30.00 not offered $180.00 $15.00 in / $90.00 out 1,000,000 Frontier reasoning, research-grade tasks, evals where every percentage point matters

Reasoning levels

Both SKUs accept reasoning.effort with five values: none, low, medium (default), high, and xhigh. Higher effort burns more reasoning tokens (billed as output) and pushes time-to-first-token up sharply — Artificial Analysis measured TTFT around 115 seconds for gpt-5.5 at xhigh, versus a couple of seconds at medium. medium is the recommended starting point. Use xhigh only for jobs where a few extra dollars and a couple of minutes of thinking are worth it; in practice that's competition math, multi-file refactors with subtle invariants, or research-grade analysis.

Note that as of GPT-5.4, tool calling is no longer supported in Chat Completions when reasoning.effort is none — another reason to default to the Responses API.

Prompt caching mechanics

Caching is automatic. Any prompt of 1,024 tokens or more becomes a candidate, with cache lookups happening on a per-server basis. Place your stable preamble (system prompt, tool definitions, retrieved-doc context) at the front; put per-request variables at the end. For gpt-5.5, cached entries persist for 24 hours by default — much longer than the 5-10 minute window earlier GPT-5 models used. in_memory caching is no longer offered for 5.5+. The cached-input price drops to $0.50 per 1M, a 90% discount, but again: this discount does not apply to gpt-5.5-pro.

Benchmarks: GPT-5.5 vs. Claude Opus 4.7 vs. DeepSeek V4 Pro

The benchmarks worth caring about for engineering buyers are the ones tied to real workflows: SWE-bench Pro and Verified (does the model actually fix GitHub issues?), Terminal-Bench 2.0 (can it drive a CLI through a multi-step task?), LiveCodeBench (held-out competitive programming, harder to memorize), GPQA Diamond (PhD-level reasoning), AIME 2025 (Olympiad math), and MMLU-Pro (a contamination-resistant successor to MMLU). Here is the snapshot at launch.

Benchmark GPT-5.5 (xhigh) Claude Opus 4.7 DeepSeek V4 Pro (Max) Notes
Artificial Analysis Intelligence Index 60 ~57 52 GPT-5.5 leads the composite at launch.
SWE-bench Verified 82.6% ~80% ~78% Some specialist agents (e.g. Mythos) score higher with scaffolding.
SWE-bench Pro 58.6% 64.3% 55.4% Opus 4.7 still leads end-to-end repo work.
Terminal-Bench 2.0 82.7% 69.4% 67.9% GPT-5.5 is well ahead on CLI/agent loops.
LiveCodeBench ~88% ~83% ~80% Pass@1 on held-out contest problems.
GPQA Diamond ~89% ~87% ~82% PhD-level science MCQ.
AIME 2025 ~96% ~92% ~93% Olympiad math, short-answer.
MMLU-Pro ~87% ~86% ~83% Contamination-resistant MMLU successor.
BrowseComp 84.4% ~78% 83.4% DeepSeek closes a lot of the gap on web research.

The honest one-line summary: GPT-5.5 leads the agentic and tool-use evals, Claude Opus 4.7 still leads codebase-resolution evals like SWE-bench Pro and CursorBench, and DeepSeek V4 Pro is closer to both than its price tag suggests, especially on browsing, terminal work, and MCP Atlas. For a deeper benchmark-by-benchmark dive on the cheaper alternative see our DeepSeek V4 vs. GPT-5.5 Pro head-to-head and the DeepSeek V4 complete guide.

Token efficiency, not just accuracy

One number that almost never makes the marketing chart but shows up immediately in your bill: on the same coding task, with the same prompt and the same goal, GPT-5.5 produces roughly 72% fewer output tokens than Claude Opus 4.7. That partially neutralizes Opus's per-token price advantage and is one reason routing layers (Cursor, Codex, Cline) increasingly pick GPT-5.5 by default and only escalate to Opus on jobs where the extra architectural reasoning is worth the verbosity premium.

What's new vs. GPT-5 and GPT-5 Codex

GPT-5.5 is more an evolution than a re-architecture, but several capabilities cross the line from "preview" to "production":

  • Agentic coding takes the front seat. GPT-5.3-Codex was the specialist; GPT-5.5 is now competitive with — and in many evals ahead of — the dedicated Codex variants while staying general-purpose. OpenAI's stated direction is to let GPT-5.5 cover most agentic engineering work and reserve Codex-specific models for narrow IDE-embedded loops.
  • Native computer use. First introduced in GPT-5.4; in 5.5 it becomes the default for desktop-control agents. The model takes screenshots, emits mouse/keyboard actions, and can drive Playwright-style browser sessions inside the same call.
  • Built-in web search. Configure with {"type": "web_search"} in the Responses API tools array. The legacy web_search_preview still works for older integrations but doesn't expose newer filters or the external_web_access control.
  • Lower output verbosity per task at the same accuracy. OpenAI claims, and third-party measurements support, that GPT-5.5 finishes Codex-style tasks with fewer tokens than GPT-5.4 — a real cost reduction on agentic pipelines.
  • Long cache TTL. 24 hours by default for cached prompt prefixes, vs. minutes on earlier GPT-5 models. For RAG and agent loops with stable scaffolding this is a meaningful win.

For a wider lens on how this generation lines up with peers, see our Muse Spark vs. ChatGPT-5.4 vs. Opus 4.6 vs. Gemini 3.1 Pro analysis and the Llama 4 vs. GPT-4.5 reference.

Responses API vs. Chat Completions

OpenAI's official position since GPT-5.4 is that the Responses API is the default for new projects. Chat Completions is still supported, but the gap is widening with each release.

Concrete differences that matter:

  • Built-in tools. Web search, file search, computer use, code interpreter, and remote MCP servers are all first-class in Responses. In Chat Completions you wire them up yourself.
  • State. Pass previous_response_id or use the Conversations API and the platform handles state. With Chat Completions you re-send the whole transcript, paying for it on every turn.
  • Cache hit rate. OpenAI's internal numbers show 40-80% better cache utilization on Responses vs. Chat Completions — a direct cost cut.
  • SWE-bench delta. Same prompt, same model, but Responses scores about 3 points higher on SWE-bench Verified than Chat Completions, because reasoning state is preserved between tool calls instead of being thrown away each turn.
  • Tool calling at reasoning.effort: none. Not supported in Chat Completions starting with GPT-5.4; works fine in Responses.
  • Structured outputs. Use text.format in Responses, not response_format. Strict mode is recommended as the default — if you've hand-rolled JSON-schema retries, delete that code.

If you're maintaining a Chat Completions integration, the migration is mostly mechanical, and the cache-hit savings tend to pay for the engineering time within a few weeks of traffic.

Tool calling, structured outputs, and the agent surface

Tool calling is unchanged in shape from GPT-5.4 — JSON-schema function tools and free-form "custom tools" (for SQL, shell, config payloads) are both supported. The two practical pieces of advice:

  1. Always set strict: true on function tools. The reliability gain is large, and the only reason not to is if you have legacy schemas with patterns the strict validator rejects (in which case, fix the schema).
  2. Stop describing your output schema in the system prompt. Use Structured Outputs with a JSON Schema instead. The model adheres to it without you spending tokens explaining it, and you don't need a "validate, retry, fix" wrapper around the call.

For full agent loops, the OpenAI Agent SDK (Python and TypeScript) and Codex CLI now both default to GPT-5.5 with medium reasoning. Codex CLI is open-source, written in Rust, and the easiest place to feel the difference between 5.5 and earlier models — the same task usually finishes in fewer turns and fewer tokens.

How to choose: GPT-5.5, Claude Opus 4.7, or DeepSeek V4 Pro

Most production teams running real volume should be using more than one of these. The hard part is the routing logic. A reasonable default:

  • GPT-5.5 (medium): the workhorse. Tool-heavy agent jobs, terminal-driven workflows, structured-output pipelines, anything where you want predictable behavior with the smallest possible token bill.
  • GPT-5.5 Pro / xhigh: reserve for genuine frontier work — research, hard math, multi-file refactors with subtle invariants. Don't put it on the hot path of a high-QPS product.
  • Claude Opus 4.7: the right choice when the task is "hold a 200k-line codebase in your head and reason about an architectural change." It still leads SWE-bench Pro by ~6 points and tends to write more idiomatic prose. The verbosity tax is real; price for it.
  • DeepSeek V4 Pro: the right choice when cost dominates. At roughly $1.74 / $3.48 per 1M tokens it's about 1/7th the price of GPT-5.5 standard, and it's within striking distance on Terminal-Bench, MCP Atlas, BrowseComp, and most knowledge-work evals. Open weights. Slower on the frontier; faster on your bill.

For a broader catalog of alternatives — including local models worth running yourself — see our top 10 ChatGPT alternatives in 2026 and the Qwen3.5 Omni-Plus vs. GPT-4o vs. Gemini 3.1 Pro comparison.

Known limitations

OpenAI's marketing won't tell you any of this, so we will:

  • TTFT at high reasoning is brutal. 115-second time-to-first-token at xhigh on the Responses API is not a misprint. If your product UX expects a streaming reply within five seconds, do not put xhigh on the hot path.
  • Verbose at high effort. The model "thinks out loud" with a lot of internal tokens. Those count as billable output. Cap with max_output_tokens for any user-facing flow.
  • Pro tier has no cached-input discount. If your workload has a stable preamble, this nukes one of the main reasons to keep prefixes long. Either drop to gpt-5.5 for the prefix-heavy calls or shorten your context.
  • Still loses to Claude Opus 4.7 on SWE-bench Pro. 58.6% vs. 64.3%. If your evaluation harness is closest to "fix this real GitHub bug across 40 files," Opus is the better default.
  • Web-search tool requires a reasoning model in the API. Non-reasoning GPT-5 surfaces don't expose it the same way through the API.
  • API-key auth in Codex CLI lagged the launch. At rollout, GPT-5.5 in Codex required ChatGPT-account sign-in; API-key auth caught up shortly after but check the changelog before you wire automation.
  • Computer-use is still narrow. Best on browser-shaped UIs and well-instrumented desktop apps. Native mobile, custom-rendered canvases, and protected enterprise apps are still flaky.
  • Costs explode quietly. Reasoning tokens are billed as output, web-search tool calls are billed per call, and computer-use sessions accumulate per-screenshot costs. Instrument before you scale.

What this means if you're hiring

The GPT-5.5 generation moves AI tooling from "smart autocomplete" to "an agent that can drive your CLI, your browser, and your IDE." That changes what a senior engineer is worth — not less, but differently. You still need humans who can architect systems, write the evals, design the prompts, decide which model handles which call, and keep the agent loop from doing something stupid in production. What changes is that one such engineer, properly equipped, replaces a much larger headcount on the routine work.

That's the kind of remote-ready developer Codersera vets for. If you're building agentic tooling, AI-driven backends, or just want a Python or AI engineer who already knows the difference between Responses and Chat Completions, talk to us.

FAQ

When was GPT-5.5 released?

April 23, 2026 in ChatGPT and Codex; April 24, 2026 in the public API.

Is there a GPT-5.5-mini?

Not at launch. The mini tier is still served by gpt-5-mini from the GPT-5 family. OpenAI has signaled a 5.5-mini is on the roadmap but has not committed to a date.

What is GPT-5.5 Pro?

A higher-capability variant available in the API as gpt-5.5-pro and in ChatGPT for Pro, Business, Enterprise, and Edu plans. It's priced at $30 / $180 per 1M tokens (input/output) and does not offer a cached-input discount.

What are the reasoning levels?

none, low, medium (default), high, and xhigh. Higher levels burn more reasoning tokens (billed as output) and have higher TTFT but score better on hard evals.

How big is the context window?

1,000,000 tokens for both gpt-5.5 and gpt-5.5-pro.

How does prompt caching work for GPT-5.5?

Automatic on prompts of 1,024+ tokens. Cache TTL is 24 hours by default for the 5.5+ family. Cached input is billed at $0.50 per 1M (90% off) on gpt-5.5; the discount is not offered on gpt-5.5-pro. Place stable content (instructions, examples, retrieved docs) at the front of your prompt to get the most cache hits.

Is the Batch API supported?

Yes, with the standard 50% discount on most models. service_tier: "flex" on the Responses API gives the same 50% discount with looser SLAs.

Should I use the Responses API or Chat Completions?

Responses, for any new project. Lower cost from better caching, higher quality on tool-heavy tasks, and built-in web search, computer use, file search, and MCP. Chat Completions is in maintenance mode for new feature surface.

Does GPT-5.5 support structured outputs?

Yes — Structured Outputs with strict JSON Schema validation. Use text.format in Responses (not response_format). Set strict: true on function tools.

How does GPT-5.5 compare to GPT-5 Codex?

On general agentic coding (Terminal-Bench, multi-step CLI tasks) GPT-5.5 is ahead. On narrow IDE-embedded refactoring loops, GPT-5.3-Codex still has a slight edge on some coding-only averages (~63 vs. ~59 on certain coding panels). For most teams the right answer is GPT-5.5 by default and Codex models only inside the Codex IDE/CLI.

How does GPT-5.5 compare to Claude Opus 4.7?

Roughly even at the headline level: Opus leads SWE-bench Pro and architectural reasoning across large codebases; GPT-5.5 leads Terminal-Bench 2.0, agentic tool-use, and token efficiency (~72% fewer output tokens on identical coding tasks). Best production setups route between them.

How does GPT-5.5 compare to DeepSeek V4 Pro?

GPT-5.5 leads the Artificial Analysis Intelligence Index 60 to 52 and is ahead on Terminal-Bench (82.7% vs. 67.9%) and SWE-bench Pro (58.6% vs. 55.4%). DeepSeek V4 Pro is roughly 1/7th the price, has open weights, and gets within a couple of points on BrowseComp and MCP Atlas. For cost-sensitive workloads or self-hosted deployments, V4 Pro is the right choice. See our head-to-head for the deep dive.

Is GPT-5.5 good at long-context tasks?

The 1M-token window is real but not free. Quality degrades on very long contexts as it does for every frontier model. For RAG, smaller, well-curated context still beats stuffing the window. Pair the long window with prompt caching and put your stable scaffolding at the front.

Can GPT-5.5 use a computer?

Yes, via the computer_use tool in the Responses API. It can take screenshots, control mouse and keyboard, and drive browser automation. Best on browser-shaped UIs; still flaky on protected enterprise apps and custom-rendered canvases.

Where can I run GPT-5.5 locally?

You can't — it's API-only and OpenAI does not publish weights. If you need on-prem or air-gapped, the closest open-weight options are DeepSeek V4 and the Qwen3.5 family.

Next steps

If you're shipping with GPT-5.5, the bottleneck is rarely the model — it's the engineer who knows how to wire Responses, structured outputs, prompt caching, tool calling, and a sane fallback policy into a system that doesn't fall over at 3am. Hire a Codersera-vetted Python or AI engineer who has already built one.