SubQ

SubQ Explained: The First 12M-Token Subquadratic LLM (2026)

SubQ claims to be the first fully subquadratic LLM with a 12M-token context window. Here's what's verified, what isn't, and why the architecture matters.

Published 18 May 2026 • Updated 18 May 2026 • 9 min read

Quick answer. SubQ is a large language model from Miami startup Subquadratic, launched May 5, 2026, claiming to be the first LLM on a fully subquadratic (linear-scaling) attention architecture with a 12-million-token context window. The architecture and launch are real and widely reported, but the headline efficiency claims are vendor-run, single-shot, and not yet independently reproduced.

On May 5, 2026, a Miami startup called Subquadratic came out of stealth claiming it had built the first large language model to escape the quadratic-attention constraint that has shaped every major transformer since 2017. Its model, SubQ, claims a native 12-million-token context window on an architecture where compute grows roughly linearly with context length instead of quadratically. The company raised a $29M seed round and ships the model behind a private-beta API.

That is a genuinely large claim, and the AI engineering community split immediately between "biggest thing since the transformer" and "AI Theranos." This guide separates what is verified (the launch, the architecture concept, the funding, the benchmark methodology) from what is not (the specific cost multiples, the 12M-token retrieval quality, real-world performance), so you can decide how much weight to put on it before it touches your stack.

What is SubQ and who built it?

SubQ is the flagship model of Subquadratic, a Miami-based AI infrastructure startup founded by CEO Justin Dangel and CTO Alexander Whedon (previously at Meta and Head of Generative AI at TribeAI). The launch was covered by The New Stack, SiliconANGLE, eWeek, VentureBeat, DataCamp, and discussed heavily on Hacker News, so the company and the launch event are not in doubt.

The verified facts:

Launch date: May 5, 2026, out of stealth.
Funding: $29M seed round, reported at a roughly $500M valuation. Investors include Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, the JAM Fund, and angels who were early in Anthropic, OpenAI, Stripe, and Brex.
Products: a developer/enterprise API exposing the full context window, SubQ Code (a CLI coding agent), and SubQ Search (a long-context search tool).
Availability: private beta, waitlist only. No public pricing. No open weights and no full technical report or peer-reviewed paper at launch.

That last point is the single most important piece of context for everything that follows: every performance number associated with SubQ today is vendor-reported, run under conditions Subquadratic controlled, and not independently reproduced.

Why does a 12M-token subquadratic context matter?

Context length has been the most expensive axis in LLM scaling because of how attention works. In a standard transformer, every token attends to every other token, so doubling the input roughly quadruples the attention compute. As Subquadratic's CTO framed it: with quadratic scaling, double the input and you need 4x the compute; with linear scaling, you need just 2x.

This is why almost every frontier model in 2026 caps practical context somewhere between 128K and 1M tokens, and why quality often degrades well before the advertised ceiling. A genuinely subquadratic model changes the economics of a few concrete workloads:

Workload	Why a huge linear-scaling context helps
Whole-repository reasoning	Load an entire mid-size codebase (millions of tokens) in one pass instead of chunking and re-retrieving.
Long-document analysis	Entire legal corpora, full books, multi-year log archives in a single prompt.
Long-horizon agents	Keep full trajectory history in context rather than summarizing and losing detail.
Cost-sensitive RAG	If long context is cheap enough, some retrieval pipelines collapse into a single long prompt.

The key qualifier: if the linear scaling and the retrieval quality at 12M tokens both hold up. Neither has been independently demonstrated yet.

How does subquadratic attention differ from a standard transformer?

SubQ is built on what the company calls SSA — Subquadratic Sparse Attention. The accessible version of the idea:

In a standard transformer, attention is dense: for each token, the model computes a relevance score against every other token in the context. With n tokens that is roughly n² comparisons. Most of those comparisons turn out to be near-zero — the model spends enormous compute confirming that token 5 is irrelevant to token 4,000,000.

SSA's bet is that you can skip most of that wasted work. According to Subquadratic's own description, SSA:

Routes attention by content, not position. For each query, the model selects which positions actually matter and computes attention only over that subset, regardless of where in the sequence they sit.
Scales with selected positions, not total length. Attention cost grows with the number of tokens it chooses to look at (a small k), not with the full sequence n — that is the source of the "subquadratic" / near-linear claim.
Preserves sparse retrieval from arbitrary positions. Unlike recurrent or compression-based long-context tricks, the company says SSA can still recover a specific fact introduced millions of tokens earlier rather than losing it to a lossy summary.

The model is trained in three stages — pre-training, supervised fine-tuning, then reinforcement learning targeted at long-context retrieval and coding. Conceptually this is an evolution of a well-established research direction (sparse and linear attention have years of literature behind them); the novel claim is shipping it as a fully subquadratic frontier-grade model rather than a research prototype or a hybrid.

A widely shared technical objection from AI engineer Will Depue is worth noting: he argued SubQ is "almost surely a sparse-attention finetune of Kimi or DeepSeek" — i.e. the strong base-model behavior comes from existing open-source weights, with SSA bolted on. DataCamp's coverage notes the CTO confirmed SubQ builds on open-source base models — most plausibly something in the DeepSeek V4 family — rather than training from scratch, which is consistent with that read and is not, by itself, a knock — but it does reframe what is actually novel here.

What are the claims versus what is actually verified?

This is the section to read twice. The efficiency numbers are the headline, and they are also where reporting varies the most. Different credible outlets cite cost/efficiency advantages ranging from ~5x to ~1,000x — not because anyone is lying, but because the multiple depends heavily on context length and which comparison you pick. Treat any single number with suspicion.

Claim	What Subquadratic states	Verification status
Speed vs FlashAttention at 1M tokens	~52-56x faster	Vendor-reported; consistent across the company's own materials, not independently reproduced
Cost vs frontier models (general)	~1/5 the cost of Claude Opus / GPT-5.5 at comparable workloads	Vendor-reported; no public API pricing exists to check it
Cost on the RULER 128K eval	~$8 for SubQ vs ~$2,600 for Opus (~300x on that one test)	Vendor-reported, single-run; the ~300x is specific to one benchmark, not a general multiple
Compute reduction at full 12M tokens	~1,000x vs other frontier models	Vendor-reported, extrapolated to a context length no benchmark publicly tests
RULER @ 128K accuracy	~95% vs Opus ~94.8%	Vendor-run; the accuracy gap is negligible — the claimed story is cost, not quality
MRCR v2 (long-context retrieval)	Lab 83-86%; production 65.9% vs GPT-5.5 74%, Opus 4.7 32.2%	Note the ~17-point lab-vs-production gap, unexplained by the company
SWE-Bench Verified (coding)	~81.8%	Vendor-reported; below Opus 4.7 (~87.6%) and GPT-5.5 (~88.7%) — SubQ lags frontier here
12M-token retrieval quality	"Over 90% on needle-in-a-haystack at 12M"	Unverified. Public benchmarks only test up to ~1M tokens; the 12M figure is asserted, not demonstrated

The honest summary: the direction of the claims is internally consistent (a sparse-attention model should be cheaper at long context), the architecture concept is sound and well-precedented, but the specific multiples are marketing-grade until someone outside the company reproduces them. The cost-advantage figure in particular is context-dependent — quoting "1,000x cheaper" as a flat fact is wrong; the company's own framing ties that number specifically to the 12M-token regime.

How can you actually try SubQ today?

Practically: you mostly can't yet, and that is part of the skepticism.

API: private beta, waitlist only at subq.ai. No public pricing, no self-serve sign-up at launch.
SubQ Code: CLI coding agent, same private-beta gate.
SubQ Search: long-context search product, initial launch positioned as free.
Weights: not open. The company has said it will not open-source in the near term, though it has floated customer-specific trained variants.
Technical report / paper: a full model card is listed as "coming soon"; no peer-reviewed paper at launch.

If you want to evaluate it seriously, the right move is to join the waitlist and run your own long-context retrieval and coding evals on your own data the moment you get access — not to plan around the published numbers.

Companion guide

For where SubQ fits among open and closed long-context models, and how the broader landscape is shifting, see our open-source LLMs landscape for 2026.

Who should actually care about SubQ?

SubQ is only interesting if your bottleneck is genuinely long context. Be honest about whether it is:

Care now (track it closely): teams doing whole-repo code understanding, long-document analysis at scale, or long-horizon agent work where context summarization is actively losing you accuracy.
Care later (wait for independent results): teams whose workloads fit comfortably in 128K-256K tokens. Frontier models already handle that well and have proven, reproducible track records and real pricing.
Don't reorganize around it yet: anyone tempted to make architecture bets on the 12M-token or 1,000x figures. Those are the least verified claims in the entire announcement.

What are the main reasons for skepticism?

The critical case against taking SubQ at face value, drawn from VentureBeat, DataCamp, and the Hacker News thread:

No weights, no paper, no independent reproduction. Every number is vendor-controlled, some from single runs because of cost. Multiple researchers have publicly said they will not form an opinion until an outside party reproduces the results.
No public pricing. The entire pitch is cost efficiency, yet there is no published API price, which makes the cost-per-task claims impossible to validate independently.
Benchmark scope is narrow. The three highlighted benchmarks (RULER, MRCR, SWE-Bench) all target exactly what SubQ is optimized for: long-context retrieval and coding. There is no published data on general reasoning, math, multilingual ability, safety, or short-input performance — and on coding (SWE-Bench) SubQ actually trails the frontier.
The lab-vs-production gap. A ~17-point drop on MRCR v2 between the research result and the deployed model, unexplained by the company, is a yellow flag for how the headline numbers were produced.
Valuation vs evidence. ~$500M valuation at seed with no public model, no peer-reviewed research, and no disclosed revenue, backed by an investor base skewed toward consumer/growth rather than deep technical AI. That does not make the tech fake, but it does mean hype incentives are strong.
Precedent. Magic.dev made structurally similar long-context and efficiency claims in 2024, raised heavily, and showed limited real-world adoption by early 2026. Extraordinary long-context claims have underdelivered before.

None of this means SubQ is fake — sparse attention is real research and the team is credible. It means the correct posture for an engineering team is interested but unconvinced: track it, join the waitlist, and reserve judgment until independent benchmarks land.

FAQ

Is SubQ real or is it vaporware?

SubQ and Subquadratic are real and well-documented: the May 5, 2026 launch, the $29M seed round, the founders, and the products were covered by The New Stack, SiliconANGLE, eWeek, VentureBeat, and DataCamp. What is unproven is the performance — there are no open weights, no technical paper, and no independent reproduction of the efficiency or 12M-token claims as of mid-2026. "Real company, unverified claims" is the accurate framing.

What does subquadratic attention actually mean?

Standard transformer attention scales quadratically: doubling the context roughly quadruples the compute, because every token attends to every other token. Subquadratic attention scales slower than that. SubQ's SSA selects only the most relevant positions per query, so attention cost grows with the small number of selected tokens rather than the full sequence length, giving roughly linear scaling in the company's description.

Is the "1,000x cheaper" claim true?

It is not a verified flat fact. Subquadratic's own framing ties the ~1,000x figure specifically to the full 12M-token regime — a context length no public benchmark tests. At 1M tokens the company claims roughly 5x cheaper, and one specific benchmark (RULER 128K) showed roughly 300x on that single test. The multiple depends entirely on context length and comparison, and none of it has independent confirmation. Treat "1,000x" as an unverified, best-case vendor figure.

Can I use SubQ in production today?

No. As of mid-2026 the API, SubQ Code, and SubQ Search are private-beta, waitlist-only, with no public pricing. There are no open weights. For production long-context work today, proven frontier models with published pricing and reproducible benchmarks remain the safer choice; treat SubQ as something to evaluate on your own data once you get beta access.

How does SubQ compare to Claude Opus 4.7 and GPT-5.5?

On vendor-run benchmarks, SubQ's retrieval accuracy at 128K is roughly comparable to Claude Opus 4.7, and its long-context retrieval (MRCR v2) is claimed to beat both Opus 4.7 and Gemini 3.1 Pro. On coding (SWE-Bench Verified) SubQ trails the frontier — roughly 82% versus ~88% for Opus 4.7 and GPT-5.5 — and also lags the strongest open-weights coding models it is reportedly fine-tuned from. The differentiator the company pushes is cost at long context, not raw capability, and that differentiator is the least independently verified part.

Why are researchers skeptical of SubQ?

Because the entire pitch rests on numbers nobody outside the company can check: no released weights, no technical report, no public pricing, some single-run benchmarks, narrow eval scope, and an unexplained ~17-point lab-vs-production gap on long-context retrieval. The $500M seed-stage valuation with no public model amplifies hype incentives, and there is precedent (Magic.dev) for similar long-context claims underdelivering. The consensus position is to wait for independent reproduction.

What should engineering teams do about SubQ right now?

Join the waitlist if long context is a genuine bottleneck for you, and prepare your own long-context retrieval and coding evals on your own data so you can judge it the moment you get access. Do not make architecture or vendor decisions based on the published 12M-token or 1,000x figures, and do not migrate production workloads off proven models until independent benchmarks confirm the claims.

If you are building long-context retrieval pipelines, agent harnesses, or evaluation infrastructure and want engineers who can cut through vendor benchmarks and validate this kind of architecture properly, Codersera matches you with vetted remote developers experienced in LLM systems and applied ML. We run a risk-free trial so you can confirm technical fit before committing.