In October 2025 NVIDIA shipped a desktop box it called an "AI supercomputer": the DGX Spark, built on the GB10 Grace Blackwell Superchip, with 128GB of unified memory in a footprint that fits next to a keyboard. A few months earlier, the GeForce RTX 5090 arrived as the most powerful consumer GPU ever sold, with 32GB of GDDR7 and bandwidth that embarrasses everything below it. Both run local large language models. They could not be more different about how.
If you write code with local models — vibe-coding inside an editor, running an agent loop overnight, or prototyping with a 70B before you commit to a cloud bill — this is the buying decision that actually matters in 2026. One machine is fast but small. The other is roomy but slow. This piece walks the real benchmarks (prefill and decode, kept separate because they tell opposite stories), the prices that have drifted far from MSRP, the power draw, and the honest "which should you buy" verdict, using only measured numbers from LMSYS, named benchmarkers, and the GitHub recipes people actually published.
What are the DGX Spark and RTX 5090, exactly?
They are answers to two different questions. The RTX 5090 answers "how fast can a single consumer card generate tokens?" The DGX Spark answers "how big a model can I fit on my desk without a server rack?"
The DGX Spark is built around NVIDIA's GB10 Grace Blackwell Superchip: a 20-core ARM CPU (10× Cortex-X925 + 10× Cortex-A725) fused to a Blackwell GPU, sharing 128GB of coherent unified LPDDR5x memory. NVIDIA quotes up to 1 PFLOP of sparse FP4 tensor performance, and LMSYS pegs its real AI capability as landing between an RTX 5070 and a 5070 Ti — not in the same league as a 5090 on raw compute. The headline number is the memory: because CPU and GPU share one 128GB pool, the Spark can load models that simply will not fit on a 32GB card. NVIDIA's official capacity claims are inference on models up to 200B parameters, fine-tuning up to 70B, and — with two units linked over ConnectX-7 (2× QSFP, 200 Gb/s) — models up to 405B in FP4.
The RTX 5090 is a Blackwell consumer GPU: 21,760 CUDA cores, 5th-generation Tensor Cores, 32GB of GDDR7 on a 512-bit bus, and a 575W total graphics power rating. The number that decides everything downstream is its memory bandwidth: 1,792 GB/s, roughly 6.5× the Spark's 273 GB/s. That gap is the single biggest driver of the 5090's decode-speed advantage on any model that fits inside its 32GB. The catch is right there in the spec: 32GB is a hard ceiling. There is no spilling gracefully into system RAM at usable speed — a model either fits, or it doesn't.
So the trade is stark. The Spark gives you 4× the memory at roughly one-sixth the bandwidth. Which one wins depends entirely on what you're trying to run and how.
Why does memory bandwidth decide everything?
The most common mistake people make reading local-LLM benchmarks is treating "tokens per second" as one number. It's two, and they're governed by different bottlenecks.
- Prefill (a.k.a. prompt processing) is reading your input — the system prompt, the codebase context, the conversation so far. It's compute-bound and embarrassingly parallel, so it loves raw FLOPs and big batches.
- Decode is generating the output one token at a time. Each new token requires streaming the entire model's weights through the compute units once. That makes decode memory-bandwidth-bound: the faster you can move weights, the faster tokens come out.
This is why the Spark and the 5090 produce such lopsided results depending on which half you measure. On NVIDIA's own GB10, prefill is genuinely strong — LMSYS measured the Spark at 2,053 tps prefill on GPT-OSS 20B (MXFP4, Ollama) and 803 tps prefill on Llama 3.1 70B (FP8, SGLang). Those are healthy throughput numbers. But decode tells the other story: 49.7 tps decode on that same GPT-OSS 20B, and a painful 2.7 tps decode on the 70B. The 70B loads — it just generates at a speed LMSYS frankly described as prototyping, not production.
The 5090, with 6.5× the bandwidth, flips decode on its head. On the same GPT-OSS 20B (MXFP4, Ollama), LMSYS clocked it at 8,519 tps prefill / 205 tps decode — about 4× the Spark's decode speed on a model that fits both machines. When you're sitting in an editor waiting for a completion, decode is the felt experience. 205 tokens/sec reads as instant; 49.7 reads as sluggish; 2.7 reads as "go get coffee."
The lesson for buyers: a vendor slide showing "2,000+ tps" is almost certainly quoting prefill or batched aggregate throughput. For interactive single-user coding, look mainly at single-stream decode — the rest is throughput context, not the latency you feel.
How do they compare on the spec sheet?
| Spec | DGX Spark (GB10) | RTX 5090 |
|---|---|---|
| Architecture | GB10 Grace Blackwell Superchip | Blackwell (consumer) |
| Memory | 128GB unified LPDDR5x (CPU+GPU) | 32GB GDDR7 |
| Memory bandwidth | 273 GB/s | 1,792 GB/s (~6.5×) |
| Compute (peak) | Up to 1 PFLOP sparse FP4 | 21,760 CUDA cores, 5th-gen Tensor |
| AI capability tier (LMSYS) | Between RTX 5070 and 5070 Ti | Flagship consumer |
| CPU | 20-core ARM (10× X925 + 10× A725) | N/A (host CPU separate) |
| Power | ~195.5W measured at wall (240W USB-C cap) | 575W GPU TGP |
| Max model (inference) | Up to 200B params (405B on two clustered units) | Whatever fits in 32GB |
| Launch price | $3,999 (Founders Edition) | $1,999 MSRP |
| Mid-2026 street (social-sourced, volatile) | ~$4,699 | ~$3,658–$4,329 |
A note on those street prices: they are social-sourced from a widely-viewed June 2026 thread, not official, and they move week to week. The 5090's MSRP is $1,999, but real Amazon listings in mid-2026 ran $3,658–$4,329 — nearly double sticker. The Spark launched at $3,999 for the Founders Edition and was quoted around $4,699 on the street. Treat both as moving targets and check live before you buy; the gap between MSRP and reality is the whole reason the value math below is messy.
How do they compare on real coding-model benchmarks?
Spec sheets argue; benchmarks decide. The clearest head-to-head comes from LMSYS's October 2025 review and a comprehensive June 2026 benchmark thread from @stevibe, both run on models developers actually reach for.
| Model / workload | DGX Spark | RTX 5090 | Source |
|---|---|---|---|
| GPT-OSS 20B MXFP4 (Ollama) — prefill | 2,053 tps | 8,519 tps | LMSYS |
| GPT-OSS 20B MXFP4 (Ollama) — decode | 49.7 tps | 205 tps (~4×) | LMSYS |
| Qwen3.6 35B-A3B Q4 — decode | 59.98 t/s | 160.37 t/s (~2.7×) | @stevibe |
| Qwen3.6 35B-A3B Q4 — time-to-first-token | 228 ms (faster) | 409 ms | @stevibe |
| Llama 3.1 70B FP8 (SGLang) — decode | 2.7 tps | Won't fit in 32GB | LMSYS |
Read those rows carefully and a consistent picture emerges. On the two models that fit both machines, the 5090 is 2.7× to 4× faster at decode — the number you feel. But notice the one row the Spark wins: time-to-first-token. On Qwen3.6 35B-A3B, the Spark answered in 228 ms versus the 5090's 409 ms. For comparison, @stevibe's full ladder on that same Qwen run put the RTX 3090 at 49.78 t/s (852 ms TTFT) and the RTX 4090 at 118.93 t/s (686 ms) — so the Spark's decode lands between a 3090 and a 4090, while its TTFT beats all of them. If your workload is latency-sensitive (short bursty completions, fast first-token feel) rather than throughput-sensitive (long generations), that gap narrows the Spark's apparent disadvantage.
And then there's the row the 5090 can't even show up for: Llama 3.1 70B doesn't fit in 32GB. The Spark runs it at 2.7 tps — slow, but it runs. That's the entire trade in a single line.
What can the DGX Spark run that a 5090 can't?
Capacity is the Spark's whole pitch, and there are three workloads where it earns the price.
1. Big-model prototyping. A single Spark loads 70B, 120B, even up to 200B-parameter models. You won't serve production traffic on a dense 70B at 2.7 tps, but you can validate that a model behaves on your data, sketch a fine-tune target, or run an overnight eval before you rent an H100. The 5090 simply cannot hold these models at all.
2. High-concurrency batched serving. This is the Spark's quiet superpower. Single-stream decode is bandwidth-starved, but batching amortizes weight movement across many simultaneous requests. LMSYS measured Llama 3.1 8B (SGLang) climbing from 20.5 tps decode at batch 1 to 368 tps aggregate at batch 32, and clocked DeepSeek-R1 14B (FP8) at batch 8 sustaining 2,074 tps prefill / 83.5 tps decode without thermal throttling. If you're serving a team, running an agent swarm, or batch-processing a dataset, aggregate throughput is what counts, and here the Spark is genuinely useful.
3. NVFP4-native workloads on modern engines. Early GGUF/llama.cpp numbers badly understate the GB10. With NVFP4-native models and current engines, the picture brightens fast: the Atlas Rust engine hit 102 stable tok/s on Qwen3.5-35B-A3B NVFP4 on a single GB10 (125+ with multi-token prediction), and Ivan Fioravanti pushed Qwen3.6-27B NVFP4 to 183 tok/s at batch 16 with vLLM. Those are real, recent, single-box numbers that change the calculus if you commit to the NVFP4 path.
For frontier-class Mixture-of-Experts models, you need two Sparks. A published DeepSeek V4 Flash recipe on 2× Spark (vLLM FP8, tensor-parallel 2) hit ~41 t/s decode single-stream, ~1,785 tps prefill, and ~350 t/s aggregate at concurrency 32. The asterisks are real: it needs the $180 ConnectX cable, two boxes, and its KV cache is a shared ~1.1M-token pool — push long context and high concurrency past that and requests time-slice (preempt) rather than crash. For context on the model itself, see our DeepSeek V4 complete guide.
What about price and power?
Power is where the Spark lands a clean, uncontested win. An owner measured the Spark drawing ~195.5W maximum from the wall (it's capped at 240W over USB-C), versus the RTX 5090's 575W for the GPU alone — before you add the host CPU, board, and the rest of a desktop. The Spark is dramatically more power- and noise-efficient. If it lives on your desk and runs agent loops for hours, that's a meaningfully quieter, cooler, cheaper-to-run machine.
Price is murkier. On MSRP, the 5090 ($1,999) looks like half the Spark ($3,999). But mid-2026 street prices erased most of that gap — the 5090 ran ~$3,658–$4,329 and the Spark ~$4,699. Worse, comparing a bare GPU to a complete computer is apples to oranges: the 5090 needs a host machine (motherboard, CPU, PSU rated for 575W-plus, cooling, case), while the Spark is the whole computer. Build a capable 5090 workstation and the total can close on, or pass, the Spark's price. The honest framing is: at street prices these are roughly the same money, and the decision should hinge on capability fit, not a sticker delta that barely exists.
Is the DGX Spark's software actually usable?
This is the part early reviews got genuinely mixed about, and it's worth being honest in both directions. The Spark is an ARM64 (aarch64) box built on the GB10's sm_121 GPU architecture — a target much of the local-LLM ecosystem wasn't compiled against on day one, so early adopters hit real toolchain and dependency friction that x86/CUDA users never see.
The improvement since has been rapid. NVFP4-native builds, the SGLang Spark image, and engines like Atlas closed that gap fast — the same hardware now serves 102 stable tok/s on a real model (Qwen3.5-35B-A3B via Atlas). NVIDIA ships a Docker-based path that's reasonable once you know it. The serve command LMSYS used looks like this:
docker run --gpus all --shm-size 32g -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" --ipc=host lmsysorg/sglang:spark \
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --host 0.0.0.0 --port 30000And speculative decoding (EAGLE3), which LMSYS measured giving up to a 2× throughput bump on the Spark, adds a draft model:
docker run --gpus all --shm-size 32g -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--env "SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1" \
--ipc=host lmsysorg/sglang:spark \
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --host 0.0.0.0 --port 30000 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 32 \
--mem-fraction 0.6 \
--cuda-graph-max-bs 2 \
--dtype float16The 5090, by contrast, runs on the CUDA x86 platform almost every local-LLM tool targets first — Ollama, LM Studio, vLLM, and llama.cpp support it out of the box. If you want zero friction, that's a real advantage, and one of the most-upvoted Spark threads (the "AMA") still flagged the early ecosystem pain. If you're deciding which serving engine to standardize on, our vLLM vs Ollama vs LM Studio comparison is the companion read, and our self-hosting LLMs guide covers the broader setup.
What about Mac Studio and Strix Halo?
A pure two-way framing would be dishonest, because anyone who's read the Reddit threads knows there are at least two more machines in this exact bracket — both offering 128GB-class unified memory at similar money.
Mac Studio (M3 Ultra) brings unified memory with 819 GB/s bandwidth — about 3× the Spark's 273 GB/s, though still well under the 5090's 1,792 GB/s. On the DeepSeek V4 Flash single-stream comparison, a Mac M2 Ultra (192GB) ran 29.7 t/s versus a single Spark's ~14 t/s and an RTX Pro 6000 (96GB) at 46.9 t/s. One pointed r/LocalLLaMA thread called the Spark "a bad 4K investment vs Mac." That's the bear case, and it deserves a hearing. The Mac's higher bandwidth and its MLX tooling make it a strong capacity-play rival; our Apple Silicon LLMs guide goes deep on that path.
Strix Halo / Ryzen AI Halo is AMD's 128GB unified-memory answer. In a widely-upvoted (241) community comparison, the Spark's clearest edge was the CUDA/NVIDIA software stack; for basic inference the two land close unless you specifically need CUDA. If your pipeline depends on CUDA kernels, the Spark's moat is real. If it doesn't, the gap shrinks.
The takeaway: the Spark isn't competing only against the 5090. It's competing against every 128GB unified-memory box, and its differentiator over the Mac and AMD options is specifically the NVIDIA software ecosystem and CUDA — not raw bandwidth, where it loses to all of them.
Which should you buy?
Skip the hedging. Here's the decision, mapped to who you actually are.
Buy the RTX 5090 if you're a single developer doing fast local coding on models that fit in 32GB. That covers a lot of ground in 2026 — a 5090 owner thread (837 upvotes) reports zero regret running Qwen 3.6 and Gemma 4 dense models for local vibe-coding inside 32GB, with the honest caveat that 32GB isn't forgiving with every model. If your daily driver is a quantized 20–35B model and you want the snappiest possible decode, the 5090 is the clear pick. It's faster where speed is felt, every tool supports it instantly, and at street prices it's no longer meaningfully cheaper — it's just better at this one job.
Buy the DGX Spark (or two) if capacity beats speed for your work: prototyping 70B–200B models before a cloud commit, serving many concurrent users or an agent swarm where aggregate throughput matters, running NVFP4-native workloads on modern engines, or building inside the CUDA ecosystem on a quiet 195W desktop. LMSYS's own framing is the fair one: the Spark is "not built to compete head-to-head" with a 5090 on speed — it's for prototyping, experimentation, and edge research. Go in wanting what it's good at, and it delivers. Go in expecting 5090 decode speed, and you'll be the person writing the disappointed Reddit post.
Look harder at a Mac Studio or RTX Pro 6000 if you want big-model capacity and better single-stream speed than the Spark, and you don't strictly need CUDA. The RTX Pro 6000 was measured 6–7× faster than the Spark across batch sizes in one visualization of the LMSYS data; the Mac trades some of that for efficiency and 192GB+ configs.
Whichever box you land on, remember the hardware is the easy half. Getting reliable local inference, agent loops, and a coding workflow that survives contact with a real codebase is where teams burn weeks. If you'd rather ship product than babysit a serving stack, Codersera places vetted remote engineers who've built exactly these local-LLM and agentic-coding pipelines — a faster path than learning SM121 quirks the hard way. For the surrounding tooling, our AI coding agents guide is a good next stop.
FAQ
Is the DGX Spark faster than the RTX 5090?
No, not for single-stream decode. On models that fit both, the RTX 5090 is roughly 2.7–4× faster at generating tokens — LMSYS measured 205 tps vs 49.7 tps on GPT-OSS 20B. The Spark's advantage is capacity (128GB vs 32GB) and aggregate throughput under heavy batching, not raw speed. The one metric it wins is time-to-first-token (228 ms vs 409 ms on Qwen3.6 35B-A3B).
Can the RTX 5090 run 70B models locally?
Not a dense FP8 70B — it won't fit in 32GB of VRAM. The 5090 is hard-capped at 32GB with no usable spillover, so it's limited to the models and quantizations that fit that 32GB budget. The DGX Spark, by contrast, loads Llama 3.1 70B — but only generates at about 2.7 tps, which LMSYS calls prototyping rather than production.
Why is the DGX Spark so slow at generating tokens?
Memory bandwidth. Decode (token generation) requires streaming the entire model through compute once per token, so it's bandwidth-bound. The Spark's LPDDR5x runs at 273 GB/s versus the 5090's 1,792 GB/s GDDR7 — about 6.5× less. That bandwidth gap, not compute, is why single-stream decode is slow. The Spark claws speed back through batching and NVFP4-native engines (102–183 tok/s on Qwen MoE models).
How much do the DGX Spark and RTX 5090 cost in 2026?
The RTX 5090 launched at $1,999 MSRP but ran ~$3,658–$4,329 on the mid-2026 street (social-sourced, volatile). The DGX Spark launched at $3,999 (Founders Edition) and was quoted around $4,699. At real street prices they're roughly the same money once you account for the 5090 needing a full host workstation around it, so price is rarely the deciding factor — capability fit is.
Can two DGX Sparks run frontier models like DeepSeek V4 Flash?
Yes. Two Sparks linked over ConnectX-7 (a $180 cable) run DeepSeek V4 Flash via vLLM with tensor-parallel 2: ~41 t/s decode single-stream, ~1,785 tps prefill, and ~350 t/s aggregate at concurrency 32. NVIDIA officially supports models up to 405B in FP4 on two clustered units. The caveats: you need two boxes plus the cable, and the KV cache is a shared ~1.1M-token pool — push long context and high concurrency past it and requests time-slice (preempt) rather than OOM.
Is the DGX Spark's software stack ready to use?
It's much better than at launch. Early on, the ARM64 / sm_121 platform meant toolchain and dependency friction the x86/CUDA world never sees. NVFP4-native models and modern engines (SGLang's Spark image, the Atlas Rust engine, vLLM) closed most of that gap — the same hardware now hits 100–183 tok/s on Qwen MoE models. The 5090 still has the smoother day-one experience since it's a standard CUDA x86 card every tool supports.