Quick answer. To generate Playwright tests with Claude Code or Cursor, install the Playwright MCP server (npx @playwright/mcp@latest), wire it into your agent, and have the agent drive a real browser so it generates locators from the live DOM instead of guessing. Then run a tight review loop: replace brittle selectors with getByRole/getByTestId, assert on outcomes, and isolate state before you commit.
AI coding agents are good at writing Playwright test scaffolding and bad at writing Playwright tests that survive a week in CI. The gap is not the model — it is grounding. An agent prompted with “write an E2E test for checkout” hallucinates selectors from training data: .btn-primary, #submit, an XPath three divs deep. None of them exist in your app. The test is green on the first run because the agent never actually ran it against your DOM, and red the moment CI does.
The Playwright MCP (Model Context Protocol) server closes that gap. It gives Claude Code or Cursor a live, controllable browser. The agent navigates your real application, reads the real accessibility tree, finds locators that actually resolve, and generates test code grounded in what it observed — not what it remembers. This guide is the full workflow: install, wire-up, generate against the live DOM, the review loop, the three failure modes that still bite, a copy-paste conventions file, and how to keep the output stable in CI. Commands are real and current as of 2026.
Why generate E2E tests with an AI agent at all?
End-to-end tests are the most expensive tests to write by hand and the first ones teams skip. Writing a single well-isolated flow — log in, navigate, perform an action, assert the outcome, clean up — is 30 to 60 minutes of fiddling with selectors and waits. Multiply that by every critical path and the suite never gets written.
An agent with a live browser collapses the selector-discovery cost. It clicks through the flow once, records the locators that resolve, and emits a Playwright spec in seconds. The economics flip: you spend your time reviewing and hardening tests instead of discovering selectors.
The catch — and the reason naive prompting fails — is that without a live browser the agent has nothing to ground on. Three things go wrong every time:
- Invented selectors. The agent writes
page.locator('.checkout-button')because that is what tutorial code looks like. Your button has no such class. - Guessed waits. Lacking real timing signal, the agent sprinkles
page.waitForTimeout(3000)— the single biggest cause of flaky Playwright tests. - No state model. The agent does not know your app needs auth, so the test 302s to a login page and asserts against the wrong DOM.
The MCP workflow fixes all three at the source: the agent sees the real DOM, so selectors resolve; it observes real navigation, so it waits on real conditions; and it hits the login wall itself, so it models auth correctly. For background on where browser-driving agents fit in the wider tooling landscape, see our AI coding agents complete guide for 2026.
What is the Playwright MCP server and how does it work?
The Playwright MCP server (@playwright/mcp, maintained by Microsoft) is a Model Context Protocol server that exposes browser control as tools an AI agent can call: browser_navigate, browser_click, browser_snapshot, browser_generate_locator, and more.
The important design decision: it works off the page’s accessibility tree, not screenshots. The agent receives a structured snapshot — roles, accessible names, states, hierarchy — rather than pixels. That is why it can generate getByRole('button', { name: 'Place order' }) instead of a fragile pixel-matched guess. Structured DOM in, stable locators out.
One trade-off to know going in: accessibility snapshots are token-heavy. Microsoft’s own benchmark puts a typical browser task at roughly 114,000 tokens via MCP versus roughly 27,000 tokens via the newer Playwright CLI (@playwright/cli), which writes snapshots to disk and lets the agent read only what it needs — roughly a 4x reduction (Microsoft-reported). MCP still wins for iterative, exploratory generation where persistent browser state and rich introspection matter; CLI wins for high-throughput agents juggling a large codebase. This guide uses MCP because test generation is exactly the exploratory, stateful case it is built for; the conventions and review loop below apply identically if you switch to CLI later.
How do you install and wire up the Playwright MCP server?
Claude Code. One command registers it for the project:
claude mcp add playwright npx @playwright/mcp@latestLaunch Claude Code and run /mcp. You should see playwright listed with its tools (browser_navigate, browser_click, browser_snapshot, …). If it is not there, the registration did not land — re-run the command from the repo root.
Cursor. Settings → MCP → Add new MCP Server, command type, value npx @playwright/mcp@latest. Or drop the standard JSON block into your MCP config (this is the portable form that also works in VS Code):
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"]
}
}
}Useful flags on the args array:
| Flag | What it does |
|---|---|
--headless | Run the browser headless (headed by default — keep it headed while generating so you can watch and intervene) |
--isolated | Keep the browser profile in memory, never write it to disk |
--browser chrome|firefox|webkit|msedge | Target a specific engine |
--caps testing | Enable test-assertion helpers for richer generated specs |
--user-data-dir <path> | Persistent profile — keep a logged-in session across runs |
For generation work, run it headed. You want to watch the agent click through the flow and step in when it hits auth or an unexpected modal.
How do you generate a test against the live DOM?
Start your app locally (or point at a stable staging URL). Then drive the agent through the loop below. The single most important instruction is to say “use playwright mcp” explicitly — otherwise Claude Code may shell out and run Playwright via bash instead of driving the live browser, which defeats the entire point.
- Handle auth yourself, once. The MCP browser is visible. Let the agent navigate to your app, then you complete login (MFA, OAuth, SSO — things the agent cannot and should not automate interactively). Cookies persist for the session.
- Give a specific scenario. Not “write tests for checkout.” Say: “Using Playwright MCP, walk the checkout flow with a saved card, then with a new card, then with an expired card. Generate one spec file per case.” Specificity is the difference between useful tests and cleanup work.
- Let it explore. The agent calls
browser_navigate/browser_click/browser_snapshot, reads the real accessibility tree, and resolves locators that actually exist. - Have it emit code grounded in observation. The generated spec uses the locators it verified live —
getByRole,getByLabel,getByTestId— not invented CSS. - Codify auth in the generated file. Interactive login does not belong in CI. Tell the agent to write the spec’s auth using Playwright’s
storageStatepattern (capture once in a setup project, reuse the saved state) so the committed test runs unattended.
For larger suites, Playwright ships first-class Test Agents. Initialise them for your tool:
npx playwright init-agents --loop=claude # or --loop=vscodeThis generates three role-specific agents:
- Planner — explores the app from a seed test and writes a human-readable strategy to
specs/*.md. - Generator — turns that plan into
tests/*.spec.ts, verifying selectors and assertions live as it goes. - Healer — replays a failing test, inspects the current UI, patches the locator/timing, and re-runs until green or reports the feature genuinely broke.
Whether you use the raw MCP loop or the Test Agents, seed the run with one example. A tests/seed.spec.ts that already handles login and lands on your base state cuts token spend dramatically — the agent copies your auth pattern instead of rediscovering it every session.
What are the three failure modes and how do you fix them?
Even with live-DOM grounding, AI-generated Playwright tests fail in three predictable ways. Every one has a concrete fix; bake the fixes into your review checklist.
Failure mode 1: brittle selectors
The agent sometimes still anchors on an auto-generated class or a deep DOM path it saw in the snapshot. It resolves today; it breaks the moment a developer renames btn-primary to button-primary in the design system. The app works fine — your CI is red, and the team is debugging selectors instead of shipping.
Fix. Enforce a hard rule: user-facing, semantic locators only. In review, replace anything brittle:
// Brittle — reject in review
await page.locator('.checkout__form > div:nth-child(3) button').click();
// Stable — require this
await page.getByRole('button', { name: 'Place order' }).click();
// or, when no good role/name exists, an explicit test id
await page.getByTestId('place-order').click();Priority order: getByRole > getByLabel > getByText > getByTestId > (last resort, justified in a comment) a CSS locator. If a flow has no stable hook, the right fix is to add a data-testid to the app, not to accept a fragile selector in the test.
Failure mode 2: timing and auto-wait mistakes
The classic tell is page.waitForTimeout(3000) in generated code. It is the number-one cause of flaky Playwright tests: 3 seconds is plenty on the agent’s machine and not enough in a loaded CI runner. Agents also forget that Playwright auto-waits on actions (click, fill) but not on snapshot/read methods.
Fix. Delete every fixed sleep and wait on a real state with a web-first assertion:
// Flaky — reject
await page.click('text=Place order');
await page.waitForTimeout(3000);
expect(await page.textContent('.status')).toBe('Order confirmed');
// Stable — web-first assertion auto-retries until true or times out
await page.getByRole('button', { name: 'Place order' }).click();
await expect(page.getByText('Order confirmed')).toBeVisible();
// For an async network step, wait on the response, not the clock
const orderResp = page.waitForResponse(r => r.url().includes('/api/orders') && r.ok());
await page.getByRole('button', { name: 'Place order' }).click();
await orderResp;Review rule: if you cannot explain in one sentence why a wait exists, delete it and wait on a UI or network condition instead.
Failure mode 3: leaked state between tests
Agents generate tests that quietly depend on each other: test A creates an account, test B assumes that account exists, the suite passes serially and explodes when Playwright shards it across workers. Auth, seeded data, and storage state all leak.
Fix. Make every test independent by construction:
- Auth goes in a fixture / setup project that writes
storageStateonce; tests load it, they do not log in inline. - Each test creates and tears down its own data (API call in
beforeEach/afterEach, or a fresh fixture) — never reuse a record another test made. - Run the suite with
--workers=4and--fully-parallellocally before you trust it. Order-dependent tests fail loudly under parallelism, which is exactly what you want before CI does it for you.
Companion guide
For where AI-generated E2E fits into a broader strategy — unit, integration, contract, and end-to-end layers — see our software testing complete guide for 2026.
What conventions file should the agent read?
The single highest-leverage thing you can do is give the agent a conventions file it reads at the start of every session. Without it you re-explain the rules in every prompt and the agent forgets them on long runs. Keep a short, repo-checked-in doc — call it tests/app.context.md — and reference it from your CLAUDE.md (Claude Code), .cursorrules (Cursor), or AGENTS.md so it is loaded once per session.
Copy-paste starting point:
# tests/app.context.md — Playwright test conventions
## How tests are generated
- Always use the Playwright MCP server against the running app
(http://localhost:3000). Generate locators from the LIVE DOM.
Never invent selectors from memory.
- Follow tests/seed.spec.ts for auth and base navigation.
## Locators (hard rule)
- Allowed, in priority order: getByRole, getByLabel, getByText,
getByTestId.
- CSS / XPath selectors are forbidden unless there is no stable
hook AND you add a one-line comment justifying it.
- No stable hook? Add a data-testid to the app component, then
use getByTestId. Do not ship a brittle selector.
## Waiting (hard rule)
- No page.waitForTimeout(). Ever.
- Wait on state: web-first assertions (expect(locator).toBeVisible()),
page.waitForResponse() for network, page.waitForURL() for nav.
- Rely on Playwright auto-wait for actions; never sleep before a click.
## Isolation (hard rule)
- Auth via storageState only — captured in a setup project,
loaded by tests. No inline login in spec bodies.
- Each test creates and tears down its own data in beforeEach/afterEach.
- No test may depend on data or state created by another test.
- The suite MUST pass with --fully-parallel --workers=4.
## Structure
- One spec file per user flow under tests/.
- Group with test.describe; one assertion target per test where practical.
- Every test asserts an OUTCOME (text/state/URL), not just that an
element is visible.
## File layout
- tests/ spec files
- tests/fixtures/ auth + data fixtures
- specs/ human-readable test plans (Planner output)
The rules above are not Playwright-specific dogma — they are the failure modes from the previous section, written as constraints the agent must satisfy before it is allowed to emit code. That is the whole trick: encode the review checklist as input, not just output.
What does the review loop look like in practice?
Generation is not one-shot. Budget 30–60 minutes per well-tested flow and run this loop:
- Generate against the live DOM with a specific scenario.
- Read the diff, not the green check. The agent did not run this in CI — you cannot trust “it passed.”
- Harden selectors. Replace any non-semantic locator per the conventions file.
- Strengthen assertions. Agents default to
toBeVisible(). Push every test to assert a real outcome — confirmed text, a URL change, a row count, a network 200. - Kill every fixed wait. Replace with a state-based wait.
- Run it parallel and isolated locally:
npx playwright test --fully-parallel --workers=4. Fix anything that only passes serially. - Re-run, then commit.
Cursor and Claude Code behave the same here — the MCP server is the shared substrate. Use whichever you already drive your codebase with; if you want the editor-native side-by-side experience for the review/harden steps, our Cursor IDE complete guide for 2026 covers that workflow in depth.
How do you keep generated tests stable in CI?
A test that passes once on a laptop is not a passing test. To make generated specs survive CI:
- Run them in CI before you trust them. The MCP loop is interactive and headed; CI is headless and parallel. Generate, push to a branch, let CI run it 3–5 times. Treat the CI result as truth, the local result as a draft.
- Turn on retries with traces, then read the traces. In
playwright.config.tssetretries: 2andtrace: 'on-first-retry'. A test that only passes on retry is not stable — it is a flaky test you have not diagnosed yet. Open the trace, find the real root cause (a missed wait, a leaked state), fix it, and do not let retries paper over it permanently. - Quarantine, do not delete. A newly generated test that flakes goes into a tagged quarantine project (still runs, does not block the merge) until it is hardened — not commented out and forgotten.
- Reproduce CI conditions locally. Run with
--workers=4headless and, where relevant, CDP network throttling, so the timing profile matches the runner. Most “works on my machine” flake is a hidden fixed-time assumption that only shows under load. - Re-pin the MCP/CLI version.
@playwright/mcp@latestis convenient for setup but pin an exact version in CI so an upstream change to snapshot format does not silently shift generated locators.
The honest framing: the agent gets you a correct-shaped test in seconds; CI is what tells you whether it is a good one. The MCP workflow removes the tedious 80% — selector discovery and flow-walking — so your senior time goes to the 20% that actually determines reliability.
Who should build and own this workflow?
Standing up AI-assisted E2E generation — MCP wiring, a disciplined conventions file, a CI harness that quarantines flake instead of hiding it — is a few days of work for an engineer who has done it before and weeks of trial and error for one who hasn’t. If you’re hiring vetted remote developers experienced with Playwright, Claude Code, and CI test infrastructure, Codersera matches you with engineers who have shipped exactly this in production, with a risk-free trial so you can validate technical fit before you commit.
FAQ
Do I need the Playwright MCP server, or can I just prompt the agent?
You need it. Without a live browser the agent has nothing to ground on — it invents selectors and waits from training data, and the test breaks the first time CI runs it against your real DOM. The MCP server (or the newer Playwright CLI) is what makes the generated locators real. Plain prompting produces tutorial-shaped code, not tests that survive.
Should I use the Playwright MCP server or the Playwright CLI?
Use MCP for interactive test generation — it is the stateful, exploratory case it was built for. Microsoft’s benchmark shows the CLI (@playwright/cli) uses roughly 4x fewer tokens by writing snapshots to disk instead of streaming the accessibility tree into context, which makes CLI better for high-throughput agents juggling a large codebase. The conventions and review loop in this guide apply identically to both.
Is the Claude Code and Cursor workflow different?
No. The Playwright MCP server is the shared substrate; Claude Code and Cursor are just different front ends that call the same browser tools. Setup differs slightly (a claude mcp add command versus Cursor’s MCP settings panel or the JSON config), but the generation loop, failure modes, conventions file, and review steps are identical. Use whichever you already drive your codebase with.
How do I handle login and MFA when generating tests?
Log in yourself during generation — the MCP browser is visible, so complete MFA/OAuth/SSO manually, then tell the agent to continue; cookies persist for the session. For the committed test, never put interactive login in the spec body. Have the agent write auth using Playwright’s storageState pattern: capture the authenticated state once in a setup project and have every test load it.
Why do AI-generated Playwright tests still flake?
Three reasons, in order: brittle selectors that resolve today but break on a UI refactor, fixed waitForTimeout() calls that are fine locally and too short in loaded CI, and leaked state between tests that only surfaces under parallel sharding. All three are fixable in review — semantic locators, state-based waits, and per-test isolation via fixtures — and the fixes belong in a conventions file the agent reads before it generates.
Can I trust a generated test that passes on the first run?
No. The MCP loop is interactive and headed; the agent often has not run the spec under CI conditions at all. Treat a first local pass as a draft. Push it to a branch, let CI run it headless and parallel 3–5 times with retries: 2 and trace: 'on-first-retry', and read the traces. A test that only passes on retry is an undiagnosed flake, not a passing test.
What goes in the conventions file the agent reads?
Encode your review checklist as input. A short tests/app.context.md (referenced from CLAUDE.md, .cursorrules, or AGENTS.md) should pin: locators allowed in priority order with CSS forbidden, no waitForTimeout() ever, auth via storageState only, per-test data setup and teardown, and a hard requirement that the suite passes with --fully-parallel --workers=4. The agent that reads the rules before generating produces far less cleanup work.