Cursor 3 + Composer 2: Walkthrough and the "Beats Opus 4.6 at 1/20 the Price" Reality Check

Quick answer. Cursor 3 (Apr 2 2026) rebuilds the IDE around an Agents Window with parallel agents and cloud↔local handoff. Composer 2 — Cursor's in-house model built on Kimi K2.5 — scores 61.7 on Terminal-Bench 2.0 vs Opus 4.6 at 58.0, at $0.50/$2.50 per million tokens. It is fast and cheap, but still trails GPT-5.5 and Opus 4.7 on the hardest work.

Cursor shipped two things this spring that change the daily-driver math for AI-assisted engineering. Composer 2 (Mar 19 2026) is Cursor's first own-brand coding model with credible frontier benchmarks. Cursor 3 (Apr 2 2026) is a ground-up rebuild of the IDE around agents instead of a text editor with a chat sidebar. Together they push Cursor closer to a managed agent platform and away from the "VS Code fork with autocomplete" lineage it started in.

This walkthrough covers what actually changed in the UI, what Composer 2 is (and isn't), how to enable both, and where the "beats Opus 4.6 at 1/20 the price" headline holds up under load. The short version: Composer 2 is the right default for fast, iterative agent work; Opus 4.7 and GPT-5.5 still win the hardest tasks. The longer version is below.

What changed in Cursor 3?

Cursor 3 is the first Cursor release where the agent is the IDE. The classic three-pane layout (file tree, editor, chat) is still available, but the new default is the Agents Window: a workspace that treats each agent run as a first-class tab with its own context, model, and execution environment.

Four concrete changes:

  • Agents Window. A dedicated workspace that hosts multiple agent sessions. Each session lives in its own tab and can be tiled side-by-side, stacked, or arranged in a grid. Tabs are independent — different repos, different models, different worktrees.
  • Parallel agents. You can run many agents at once across local, worktree, cloud, and remote SSH environments. Each agent operates inside an isolated git worktree (via the /worktree command), so concurrent edits do not collide.
  • Cloud↔local handoff. Agents move between cloud infrastructure and your local machine mid-task. Start a long-running refactor on cloud, hand the diff back to a local agent for review and finishing. Cloud agents produce screenshots and demo videos so you can verify work without running it yourself.
  • Faster repo indexing. Time-to-first-query for the median repo drops from 7.87 s to 525 ms. P90 falls from 2.82 minutes to 1.87 s. P99 falls from 4.03 hours to 21 s. Index reuse across teammates is now the default, so onboarding a new developer no longer means rebuilding the index from scratch.
  • New PR review surface. The Reviews tab shows inline threads and top-level comments; the Commits tab gives a focused history view; the Changes tab adds a file tree and changes picker for navigating large PRs.

The strategic shift is real. Cursor 3's interface is closer to Google Antigravity or Devin's hosted-agent dashboards than to VS Code. If you came to Cursor for autocomplete and the chat panel, the new defaults will feel alien for a week. If you came for Composer or background agents, the new defaults are an obvious upgrade.

What is Composer 2 and how does it compare to Opus 4.7 and GPT-5.5?

Composer 2 is Cursor's second-generation in-house coding model. Cursor initially marketed it as a proprietary model; a developer reading the model identifier discovered it is built on Kimi K2.5, Moonshot AI's open-weight Chinese model, accessed through Fireworks AI under a commercial partnership. Cursor's VP of developer education Lee Robinson later clarified that ~75% of the compute behind Composer 2 was Cursor's own reinforcement learning training; ~25% was the K2.5 base.

The benchmark numbers, on Cursor's published evaluations and the public Terminal-Bench leaderboard:

ModelTerminal-Bench 2.0Pricing (in / out per 1M tokens)Context
Composer 2 (Cursor)61.7$0.50 / $2.50256K
Claude Opus 4.658.0$5 / $251M
GPT-5.475.1(varies by tier)1M+

Two newer frontier releases sit above Opus 4.6: Claude Opus 4.7 leads SWE-bench Pro at 64.3% (real GitHub-issue resolution), and GPT-5.5 still tops Terminal-Bench at around 82.7% on the most recent runs. Composer 2 does not match either of those on the hardest benchmarks — but the headline "beats Opus 4.6 at 1/20 the price" is real.

Price-per-task math, with conservative assumptions. A typical agentic coding task on Cursor (call it 80K input tokens, 20K output tokens):

  • Composer 2: $0.04 input + $0.05 output = ~$0.09 per task
  • Opus 4.6: $0.40 input + $0.50 output = ~$0.90 per task
  • Opus 4.7 (cached-heavy run): ~$0.30–$0.60 per task

For high-volume, low-stakes work — boilerplate, test scaffolding, doc passes — Composer 2's per-task cost rounds to noise. For genuinely hard work — architectural refactors, novel bug investigation, cross-repo reasoning — the marginal extra dollar on Opus 4.7 or GPT-5.5 usually pays for itself in fewer retries and less human review time.

How do you enable Composer 2 and the Agents Window?

Cursor 3 enables Composer 2 by default in Auto mode for new installs. If you are upgrading from Cursor 2.x:

  1. Update Cursor. Cursor → Check for Updates. You want 3.0 or later (3.1 added tiled layouts in the Agents Window and improved voice input).
  2. Open the Agents Window. Cmd/Ctrl + Shift + A, or the new icon in the activity bar. The window is separate from the editor and can be popped out.
  3. Pick a model. The model picker now lives in the agent tab header, not the global chat panel. Choose composer-2 for Cursor's in-house model, auto to let Cursor route, or any of the third-party models you have keys for.
  4. Toggle environment. Each tab picks its execution environment: local working tree, isolated worktree (/worktree), cloud sandbox, or remote SSH. Cloud and SSH require workspace settings → Agents to be configured first.
  5. (Optional) Enable Design Mode. In an agent tab, Design Mode lets you point-and-click on UI elements in the embedded browser to give the agent a targeted reference. Useful for front-end work; skip it for backend.

If you want to ship Composer 2 across a team, set it as the default in Settings → Models → Default for Auto, then enforce via workspace settings checked into the repo.

How do parallel agents work in practice?

The parallel-agents pattern is the biggest day-to-day workflow change in Cursor 3. Three concrete examples we have used:

  1. Refactor + tests + docs in three tabs. Tab 1: refactor a service to use a new auth helper (Composer 2, local worktree). Tab 2: write tests for the new interface (Composer 2, sibling worktree). Tab 3: update the README and OpenAPI doc (Composer 2, third worktree). All three run concurrently; you merge worktrees at the end. Total wall-clock: ~7 minutes for ~600 lines of changes.
  2. Best-of-N on a hard bug. Spawn three agent tabs against the same bug — one with Composer 2, one with Opus 4.7, one with GPT-5.5 — each in its own worktree. Compare diffs after ~10 minutes. Usually one solution is obviously cleaner; sometimes you cherry-pick parts of two.
  3. Cloud agent + local agent split. Long-running task (large codebase migration, multi-file refactor) on a cloud agent overnight. Next morning, hand the WIP branch off to a local agent in a fresh tab for finishing touches and review. Cloud agent's screenshot/demo log makes the handoff auditable.
  4. Multi-repo coordination. Frontend tab in repo A, backend tab in repo B, infra tab in repo C — all running in parallel. The Agents Window is inherently multi-workspace, so cross-repo work no longer requires three Cursor windows.
  5. Continuous sweep. One tab dedicated to a long-running "fix all flaky tests in tests/integration" loop with a high step budget, while you continue normal work in other tabs. Composer 2's speed (around 200 tok/s in Cursor's own benchmarks) makes sweep-style work feel responsive instead of glacial.

The mental shift is small but real. You stop driving one agent at a time and start orchestrating. Most experienced users settle on 2–4 concurrent agents — beyond that, context-switching cost eats the throughput gains.

When should you use Composer 2 vs an external model?

The decision matrix that has held up across a few weeks of real use:

Task typeBest fitWhy
Boilerplate, scaffolding, test writingComposer 2Fast, cheap, accurate enough; cost rounds to zero
Sweep refactors across many filesComposer 2200 tok/s speed makes high step counts tolerable
Doc passes, comment cleanupComposer 2Trivially within Composer's strength zone
Architectural refactor across a large codebaseOpus 4.7Best long-context reasoning; SWE-bench Pro leader
Terminal-heavy DevOps automationGPT-5.5Dominant Terminal-Bench score (~82.7%)
Novel bug investigationOpus 4.7 or GPT-5.5Better at hypothesis search and tool-use precision
Multi-step agent loops with tight budgetsComposer 2Cheap enough to run with generous step budgets
Production-critical generation (financial, security)Opus 4.7Lowest hallucination rate; best at architectural correctness

The lazy default we ended up at: Composer 2 for everything until it visibly fails, then escalate to Opus 4.7 or GPT-5.5. With Cursor's Auto routing you can let the platform make this call, but explicit escalation is faster than re-running a failed Composer task.

Companion guide

For everything Cursor — features, workflows, comparisons — see our Cursor IDE complete guide for 2026.

What are Composer 2's known limitations?

Composer 2 is impressive for the price, but it is not a silver bullet. Four issues are worth knowing before you commit a team to it:

  • Cursor-only. Composer 2 is exclusive to the Cursor IDE. There is no public API, no third-party tool integration, no way to call it from your own harness, CLI, or CI pipeline. If your workflow uses Aider, Claude Code, or a homemade agent loop, Composer 2 is not an option.
  • Agent-mode-first. Composer 2 is tuned for the agent loop. It is available for inline edits but the gap vs. Opus 4.7 or GPT-5.5 is larger there than in agent mode. If your team mostly uses inline edits and Tab completions, Composer 2's price advantage matters less.
  • Large-context degradation. 256K context is generous in absolute terms but half of what Opus 4.7 and GPT-5.5 offer. On tasks that genuinely need 500K+ tokens (whole-repo analysis, long log inspection), Composer 2's quality drops faster than the frontier models'.
  • Tooling rough edges. Early-release issues continue to surface in Cursor's forum — for example, the reasoning_content field from DeepSeek-style thinking models is not rendered cleanly in some agent tabs, and the 200K-token Auto-mode cap on certain DeepSeek configs has been confusing for users expecting full context. These are fixable, not fundamental, but they bite during a migration.

None of these are dealbreakers. They are the reasons to keep Opus 4.7 or GPT-5.5 access alongside Composer 2 instead of switching the team to Composer-only.

Who can help you roll this out?

Adopting Cursor 3 and Composer 2 across a real engineering team is more than "install the update." Done well, it changes how PRs are scoped, how reviews are structured, how budgets are tracked, and how junior engineers are trained. Codersera matches you with vetted remote developers who have shipped agentic-coding workflows in production — Cursor power users, harness authors, and AI-pair-programming-experienced engineers who can lead the rollout instead of being users of it. We run a risk-free trial so you can validate technical fit before committing.

FAQ

Does Composer 2 actually beat Claude Opus on real work?

It beats Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 58.0) and on Cursor's own internal coding evals. It does not beat Opus 4.7, the current Anthropic flagship, on the hardest benchmarks — Opus 4.7 leads SWE-bench Pro at 64.3% on real GitHub issues. For 80% of day-to-day engineering tasks, Composer 2 is good enough at 1/20 the cost; for the remaining 20%, Opus 4.7 still earns its price.

Is Composer 2 just Kimi K2.5 with a different label?

No, but the relationship is closer than Cursor's launch post implied. Cursor's VP of developer education later confirmed Composer 2 uses Kimi K2.5 as the base model, with ~25% of compute from the base and ~75% from Cursor's own reinforcement learning training. The K2.5 weights are accessed through Fireworks AI under a commercial agreement with Moonshot AI. Functionally, treat it as a Cursor-tuned coding specialization of K2.5.

Can I use Composer 2 outside of Cursor?

No. Composer 2 is exclusive to Cursor IDE. There is no public API, no SDK, and no integration with external agent frameworks. If you need a comparably cheap coding model in your own harness, look at the underlying Kimi K2.5 (available via Moonshot AI or Fireworks AI) or open-weight alternatives like DeepSeek V4.

How many parallel agents can I actually run?

Technically there is no hard cap; practically most experienced users settle on 2–4 concurrent tabs. Beyond that the cognitive load of tracking each tab's progress eats the throughput gains. Worktree isolation prevents file conflicts; the bottleneck is human attention, not machinery.

Do I need the Team or Business plan for cloud agents?

Cloud agents and cloud↔local handoff require a paid plan that includes the cloud-agents quota; Cursor's Pro plan exposes cloud agents with usage limits, while Team and Business plans offer larger quotas and shared workspace settings. Check the current pricing page for exact tier inclusions before rolling out across a team.

Will Cursor 3 replace the classic IDE layout entirely?

No. The classic three-pane editor is still available and still receives updates; Cursor 3 just changes what opens by default. You can pin the editor and ignore the Agents Window, or live in the Agents Window and call up the editor for surgical edits. The two are complementary, not exclusive.

Is Cursor 3 stable enough for production teams?

The 3.x line has shipped patch releases on a fast cadence (3.1 added tiled layouts and improved voice input within weeks of 3.0). Early adopters report a noticeable workflow upgrade once they internalise the Agents Window; the rough edges are mostly UI polish and model-specific issues, not core reliability. For a low-risk rollout, pilot with two or three engineers for a week before flipping the team default.