Quick answer. Subquadratic, a Miami-based startup that came out of stealth on May 5, 2026 with $29M in seed funding, launched SubQ — the first frontier LLM with a 12-million-token context window. It uses a proprietary architecture called Subquadratic Selective Attention (SSA) that scales linearly with context length instead of quadratically, claiming 52x faster attention at 1M tokens and 92.1% needle-in-a-haystack recall at 12M. Independent verification is still pending.
For three years the long-context conversation has been a slow ladder: 128K, 200K, 1M, 2M. On May 5, 2026, a small Miami startup called Subquadratic jumped six rungs in one move with SubQ, a model advertising a 12-million-token context window — roughly six times Gemini 3.1 Ultra's 2M ceiling and forty-eight times what most production OpenAI and Anthropic models will accept today.
The claim is bold enough that researchers are openly demanding independent benchmarks. Below is what's verifiable from Subquadratic's launch materials, what looks plausible based on the published architecture sketch, and what developers should actually do with a 12M-token window today.
What is SubQ and why does 12M tokens matter?
SubQ is the first model from Subquadratic, a Miami-based AI infrastructure startup founded by CEO Justin Dangel and CTO Alexander Whedon (formerly Head of Generative AI at TribeAI; previously a software engineer at Meta). The team includes roughly eleven PhD researchers with prior stints at Meta, Google, Oxford, Cambridge, ByteDance, and Adobe. The company raised $29M in seed funding at a reported $500M valuation, with investors including Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early backers of Anthropic, OpenAI, Stripe, and Brex.
The headline number — 12 million tokens — is roughly:
- The entire Linux kernel source tree (~10M tokens), with room for your patch notes
- War and Peace, Anna Karenina, and the complete works of Dostoevsky together
- About 9,000 pages of legal discovery
- A medium-sized SaaS monorepo plus every PR description from the last two years
If — and this is the load-bearing if — the model can actually use all 12M tokens with meaningful recall, it collapses a category of problems we've been solving with retrieval-augmented generation (RAG), chunking, and summarization. Those workflows exist because dense attention's compute cost scales quadratically with sequence length: doubling the context quadruples the work. At 12M tokens, vanilla attention would require something like 144 trillion pairwise comparisons per layer. Nobody can afford that.
How does Subquadratic's architecture actually work?
SubQ's secret sauce is an attention mechanism the company calls Subquadratic Selective Attention (SSA). Dense attention compares every token with every other token; SSA, per the company's published description, takes a content-dependent selective approach:
- For each query token, the model picks a small subset of positions in the sequence that actually matter for that token.
- It then computes exact attention only over that subset.
- The selection is learned — content-dependent, not a fixed positional pattern like Longformer's sliding window or BigBird's global tokens.
This is a distinct lineage from the other subquadratic families. Mamba and RWKV replace attention entirely with a recurrent state that evolves token by token. That gives true O(n) scaling, but fixed-capacity recurrence means information gets compressed and old context can blur or get overwritten. Fixed-pattern sparse attention (Longformer, BigBird) scales linearly but loses recall on needle-in-a-haystack tasks because the pattern doesn't know where the needle is. SSA tries to combine the best of both: linear scaling like Mamba, exact-attention precision over the selected subset like a dense transformer, and content-aware routing so the model decides where to look.
The weights and training recipe aren't open. The arXiv preprint that surfaced post-launch describes the SSA mechanism at a high level but stops short of releasing model checkpoints or a full reproducibility kit. Until the weights or a third-party reproduction land, treat SSA as a credible architectural sketch rather than a settled result.
How does SubQ compare to other frontier long-context models?
Here's the snapshot as of late May 2026:
| Model | Max context | Architecture | Published long-context benchmark |
|---|---|---|---|
| SubQ (Subquadratic) | 12,000,000 tokens | Subquadratic Selective Attention (SSA) | 92.1% needle-in-a-haystack @ 12M; 97.1% RULER @ 128K (vs Opus 4.6's 94.8%); 83 MRCR v2 |
| Gemini 3.1 Ultra | 2,000,000 tokens | Dense transformer w/ MoE | 23 MRCR v2 |
| Claude Opus 4.7 | 1,000,000 tokens (1M tier) | Dense transformer | 78 MRCR v2; 87.6% SWE-bench |
| GPT-5.5 | 400,000 tokens | Dense transformer w/ MoE | 39 MRCR v2 |
On MRCR v2 (multi-needle retrieval), Subquadratic's reported numbers put SubQ ahead of every frontier dense model — 83 vs 78 for Opus 4.7, 39 for GPT-5.5, 23 for Gemini 3.1 Pro. On RULER at 128K (a long-context retrieval benchmark), SubQ reports 97.1% accuracy (vs Claude Opus 4.6's 94.8%, per Subquadratic's own technical brief). The 92.1% needle-in-a-haystack figure at 12M tokens is the eye-catching one; for comparison, frontier models that quote a 1M or 2M window often see recall fall off well below the advertised ceiling.
The catch: all of these numbers come from Subquadratic's own technical brief. No independent lab has reproduced them yet. The company has been transparent that the model is in private beta and the API is gated, which is the right call before claims of this magnitude get pressure-tested in the wild. Researchers on X have already asked for held-out evals.
What does 12 million tokens actually enable?
Assume for a moment the recall numbers hold. Here's what changes:
Whole-codebase reasoning without retrieval
Most "AI for codebases" today is a retrieval problem dressed up as a reasoning problem. You embed each file, do similarity search against the user query, stuff the top-k chunks into a 200K window, and hope the model spots the cross-file refactor. With 12M tokens you can load the entire repo plus the test suite plus the last 50 commit messages into a single prompt and ask the model to reason across all of it. Subquadratic ships this as SubQ Code, a CLI agent built on the same model.
Long-form legal research and discovery
A 9,000-page deposition set fits comfortably. So does "every contract this company has signed since 2020" for due diligence work. The MRCR v2 score (multi-needle retrieval) matters more than raw context size here — you want the model to find the three indemnity clauses scattered across the corpus, not just summarize the first 200 pages.
Full medical history or research corpus in one prompt
The full longitudinal record for a single patient — visit notes, imaging reports, lab results, prior authorizations — rarely exceeds a few million tokens. Same goes for a literature review across an entire subfield. Both are use cases where the chunking-and-RAG workflow loses information because the model never sees the corpus as a coherent whole.
Long-running agent state
An agent that keeps weeks of conversation, tool outputs, and intermediate scratchpads in its prompt rather than swapping to vector storage. Subquadratic's SubQ Search deep-research tool is the company's first showcase of this pattern.
For deeper background on the broader long-context and open-source-LLM landscape these models sit in, see Codersera's open-source LLMs landscape 2026 guide, and if you're considering running comparable models yourself, the self-hosting LLMs complete guide covers the infrastructure tradeoffs.
Does SubQ actually use the full 12M, or is it the classic long-context illusion?
This is the question that matters more than the headline number, and it deserves a careful answer.
Most "long-context" models since 2023 advertise a window much larger than they can actually exploit. The "lost in the middle" phenomenon — where models forget information placed at the midpoint of a long prompt — is well documented. Gemini 1.5's original 1M context was a marketing milestone; in practice many production teams cap usage at 200-300K because recall degrades beyond that.
Subquadratic's published numbers — 92.1% needle-in-a-haystack at 12M, 83 on MRCR v2 — are aggressive enough that, if reproducible, they'd be the first frontier system to use a multi-million-token window with frontier-quality recall. Three honest caveats:
- Needle-in-a-haystack is the easy benchmark. It tests whether the model can find a single inserted sentence. Real-world tasks need reasoning across dispersed information, which MRCR v2 starts to capture but still understates.
- No external reproduction yet. The benchmarks come from Subquadratic's own report. The arXiv companion paper describes SSA but doesn't ship weights.
- Frontier-scale subquadratic models have historically regressed elsewhere. Mamba, RWKV, Hyena, RetNet, and Kimi Linear all showed linear scaling at small sizes but underperformed dense attention on downstream benchmarks at frontier scale, or ended up in hybrid configurations. SubQ may have cracked that — or it may not show until independent groups put it under load.
The right posture for developers right now: treat SubQ as a serious-looking architectural claim worth testing on your own data, not a settled win.
What does SubQ cost and how do I get access?
Subquadratic is launching three products in private beta:
- SubQ API — direct access to the model with the full 12M-token window.
- SubQ Code — a CLI coding agent built on the same model (the natural killer app: load an entire repo).
- SubQ Search — a deep-research tool that exploits the long window for multi-document synthesis.
Pricing is the other headline. The company claims running the RULER 128K eval cost approximately $8 on SubQ versus roughly $2,600 on Claude Opus at the same context length. That's the 1,000x efficiency claim, and it's the figure researchers most want validated. If even half that gap survives independent benchmarking, the economics of long-context work change materially. Public per-token API pricing wasn't published with launch; access is currently waitlist-based via subq.ai.
What are the real risks and tradeoffs?
- Latency at extreme contexts. Even with linear scaling, 12M tokens is a lot of bytes to ship and process. Time-to-first-token will not be Claude Haiku-fast on a million-token prompt. Subquadratic claims a 52x speedup over dense attention at 1M, but that's relative, not absolute.
- Memory cost. Linear compute does not necessarily mean linear memory. KV-cache footprint at multi-million tokens is still meaningful, and serving costs may push the company toward aggressive batching or quantization.
- Quality at the long tail. Multi-needle retrieval and synthesis tasks beyond what MRCR v2 measures (e.g., "find every dependency on this deprecated function across a 10M-token monorepo and propose a migration") are the real test. Run your own evals on real data before betting a production workflow on the 12M number.
- Single-vendor risk. SubQ is closed weights from a 12-person seed-stage startup. The right way to use it today is for tasks where the gain is large enough to justify the lock-in — and to keep your prompts portable enough to fall back to Opus or Gemini if the company hits turbulence.
- Reproducibility gap. Until independent labs verify the headline benchmarks, treat the marketing numbers as a hypothesis. Independent third-party MRCR v2 reproductions reported 65.9% — well below Subquadratic's 83% — and remain a strong skeptic signal.
FAQs
Is SubQ open source?
No. SubQ's weights are closed, and access is via Subquadratic's private-beta API. The company has released a technical brief describing the SSA architecture, but no checkpoint, training code, or full reproducibility kit.
How is SSA different from Mamba or RWKV?
Mamba and RWKV replace attention entirely with a recurrent state that evolves token by token. SSA keeps attention but makes it content-dependent and sparse: for each token, the model selects a small subset of positions to attend to, then computes exact attention over that subset. This is meant to combine Mamba's linear scaling with the precise long-range recall of dense attention.
Is 12 million tokens actually useful, or just a marketing number?
It's useful if the model can actually reason across the full window. Subquadratic's reported 92.1% needle-in-a-haystack at 12M and 83 on MRCR v2 suggest it can — but those numbers are self-reported and haven't been independently reproduced.
How much does SubQ cost?
Per-token pricing wasn't published with launch. The company's headline efficiency claim is ~$8 to run the RULER 128K eval on SubQ vs ~$2,600 on Claude Opus at the same context — roughly a 1,000x efficiency claim that is awaiting independent verification.
When will I be able to use it?
SubQ is in private beta as of May 2026 via subq.ai. Subquadratic has signaled a 50-million-token (some sources cite 100M) target context window by Q4 2026 and broader API availability over the next quarters.
Does this make RAG obsolete?
Not yet. Even if SubQ delivers on its claims, RAG still wins on cost for narrow lookups and on freshness for data that changes faster than you can re-feed the model. A 12M-token window changes the calculation for tasks where the full corpus needs to be reasoned over as a unit — but it doesn't eliminate the case for retrieval-augmented patterns.
What should I actually do with this information?
If you have a real long-context problem — whole-codebase refactoring, legal discovery, multi-document research synthesis — get on the SubQ beta waitlist and run your own evals. Don't rip out your existing retrieval pipeline based on a single launch announcement. The architecture is interesting enough that even if SubQ's specific numbers don't hold up, the SSA approach is likely to influence the next wave of long-context models from larger labs.