Local LLM Hardware Showdown — June 2026: DGX Spark vs Strix Halo vs RTX 6000 Pro vs M5 Max

Four credible 128GB-class boxes, four very different price points. We synthesise what practitioners with the hardware on their desks are actually reporting.

Published 16 Jun 2026 • Updated 16 Jun 2026 • 8 min read

Published: June 16, 2026. We refresh this guide whenever a new SKU ships, vendors revise pricing, or practitioner benchmarks land that change the recommendation.

Quick answer. Want CUDA + the lowest total cost of ownership for serious local inference? DGX Spark ($4k). Want a single machine that's both your daily driver and your LLM box? M5 Max 128GB MacBook Pro ($5k). Need workstation-grade training and don't blink at the bill? RTX 6000 Pro ($10k). AMD Strix Halo (Ryzen AI Max+ 395) is tempting on price but currently the riskiest pick — read the caveats before you buy.

The mid-2026 local-LLM hardware moment is the most interesting it has ever been. Three credible 128GB-class boxes ship at very different price points — NVIDIA's DGX Spark at $4k, AMD's Strix Halo / Ryzen AI Max+ 395 around $2k–$3k, and Apple's M5 Max 128GB MacBook Pro at $5k — plus the workstation tier RTX 6000 Pro at $10k for teams that need real training throughput.

We do not have these boxes on our desks. What we have is the strongest independent reporting we could find from practitioners who do, cross-validated where claims overlap, and a framework we'd use ourselves if buying tomorrow. This page is the single landing surface we point engineering teams to when scoping a local-LLM workstation purchase.

TL;DR — Should you care?

If you're an indie hacker on a budget: The Strix Halo is the cheapest path to 128GB of unified memory, but the software story (no CUDA, no MLX, real-world bandwidth well below the marketing spec) means you're signing up for headaches. Wait, or get a used Mac Studio.
If you're an AI researcher who needs CUDA: DGX Spark at $4k is the obvious answer. Same software stack as the cloud, 128GB, fits on a desk.
If you're a founder who wants one machine for everything: M5 Max 128GB. It's your daily driver, your LLM box, your travel rig. MLX is genuinely fast on Apple Silicon now.
If you're an enterprise prosumer / training small models: RTX 6000 Pro. 96GB of real workstation VRAM, full CUDA stack, the only one of these that actually trains anything serious without complaint.

The four contenders

NVIDIA DGX Spark — $4,000

NVIDIA's first "personal AI computer." 128GB unified LPDDR5X memory, GB10 Grace Blackwell superchip, full CUDA stack, roughly a Mac Mini footprint. The pitch: take the cloud software environment you already use and put it on your desk. Every inference engine — vLLM, llama.cpp, TensorRT-LLM, SGLang — just works.

AMD Strix Halo / Ryzen AI Max+ 395 — ~$2,000–$3,000

The cheapest path to 128GB. Sold inside the Framework Desktop, HP Z2 Mini, and several mini-PC OEM builds. Paper bandwidth is competitive (~256 GB/s claimed). In practice, as we'll see, the usable number is materially lower, and the software stack is ROCm-or-Vulkan rather than CUDA.

NVIDIA RTX 6000 Pro Blackwell — $10,000

The workstation tier. 96GB GDDR7 ECC VRAM, 600W TDP, real PCIe in a workstation chassis. The only box here that can train (not just inference) serious models at speed. If you're doing LoRA / QLoRA / SFT on 13B–70B models for work, the price is a rounding error against engineering hours saved.

Apple MacBook Pro M5 Max 128GB — $5,000

The dark horse. 128GB unified memory, MLX has matured into a competitive inference runtime, and unlike the other three this is also a laptop you carry. The trade-off is no CUDA — you live in MLX / GGUF land. For solo founders who want one machine, increasingly the obvious pick. Deep dive in our Apple Silicon LLMs guide.

DGX Spark vs Strix Halo — the head-to-head

The most independent comparison currently available is from @sudoingX, who has both boxes on his desk. Important disclosure up front (his words):

"full disclosure, same as the last post. nvidia and amd plus framework all sent these boxes for honest testing, no money…"

— @sudoingX, June 2026

Vendor-supplied review units are the norm in this corner of the industry. We weight @sudoingX's results because he discloses, his numbers line up with what practitioners with retail units are independently measuring, and he runs both boxes with identical models and prompt suites. The framing post:

"the results are in. two 128gb boxes on my desk, the nvidia dgx spark and the amd strix halo. everyone argues which one…"

— @sudoingX, June 2026 (144 likes / 6 RTs)

The headline finding that surprised people: on Strix Halo, Vulkan beats ROCm by roughly 17% on token generation, while the two are dead-tied on prompt processing.

"yep, vulkan beat rocm by ~17% on token gen, dead tied on prompt processing. full breakdown is its own post soon. and th…"

— @sudoingX, June 2026

That's a meaningful result. The recommended runtime on Strix Halo right now is not the AMD-branded ROCm path most buyers reach for first — it's Vulkan via llama.cpp. Head-to-head against DGX Spark on the same models, the Spark wins on token generation by a comfortable margin and is roughly comparable on prompt processing, with the added benefit that every CUDA-targeted runtime runs unmodified.

The AMD Strix Halo caveat

The case against the Strix Halo as a serious LLM box, in the strongest terms we've seen, comes from @jun_song:

"Never buy the AMD AI Max+ 395.
> No CUDA, no MLX
> Actual usable bandwidth is incredibly slow at around 180GB/s
> Price isn't m…"

— @jun_song, June 2026 (280 likes / 12 RTs)

The 180 GB/s figure is the load-bearing number. AMD markets a higher theoretical ceiling, but practitioners measuring real LLM workloads keep landing around 180 GB/s — closer to a high-end laptop CPU than to a dedicated AI accelerator. Token generation is directly capped by this. @TheAhmadOsman piles on:

"I am seeing a lot of posts on Ryzen AI Halo with blatantly wrong prices & performance numbers. Are these undisclosed…"

— @TheAhmadOsman, June 2026 (53 likes / 2 RTs)

To be fair: Strix Halo is the cheapest 128GB-unified-memory box you can buy, it doesn't lock you into a single vendor's software stack, and the Framework Desktop is genuinely repairable. If your primary use case is 30B-class models at modest speeds for hobby work, and you prefer the open-software side, it can make sense. But you are buying a hobbyist machine, not a workhorse — expect Vulkan-flag weekends rather than shipping product.

The Apple Silicon angle

The M5 Max 128GB MacBook Pro at $5k is the most underappreciated entry. Apple's unified memory means the GPU sees the full 128GB without PCIe round trips, MLX is now competitive on token generation with DGX Spark for many model sizes, and quantised builds of Llama 4 70B, DeepSeek V4 distillations, and Qwen 3.5 run comfortably.

The case for Apple: this is the only box that's also your daily driver. No separate machine for email, IDE, design, meetings. For solo founders where the LLM box would otherwise sit idle most of the day, that's a real economic argument.

The case against: no CUDA. If production targets a Linux + NVIDIA cluster, your local environment never perfectly mirrors it. Training is also significantly weaker than the RTX 6000 Pro. Full details in our Apple Silicon LLMs complete guide.

Heterogeneous setups — an emerging pattern

One of the more interesting open questions in the 2026 local-LLM scene is whether you should pick a single box at all. @TheAhmadOsman floats the idea:

"I haven't seen any good LLM software for heterogeneous hardware, like DGX Spark for prefill and Mac Studio for decoding,"

— @TheAhmadOsman, June 2026 (43 likes)

The intuition is real. Prefill (prompt processing) is compute-bound and loves CUDA throughput; decoding (token generation) is memory-bandwidth-bound and well-served by Apple's unified memory. A setup that ran prefill on a DGX Spark and streamed the KV cache to a Mac Studio for decoding would, in theory, beat either box alone on long-context workloads.

Nobody has shipped good software for this yet. SGLang and vLLM have prototype "disaggregated prefill" features, but they target homogeneous CUDA clusters, not cross-vendor desktops. Watch this space — if disaggregated inference goes mainstream by Q4 2026, the buying calculus changes meaningfully.

Buying recommendation framework

If you are…	And your budget is…	Buy	Why
An AI researcher / engineer who needs CUDA parity with the cloud	~$4k	DGX Spark	Same software stack as production, 128GB, every inference engine works.
A solo founder / consultant who wants one machine for everything	~$5k	M5 Max 128GB MacBook Pro	Daily driver + LLM box + travel rig in one. MLX is genuinely fast.
A small team training / fine-tuning models	~$10k	RTX 6000 Pro	96GB workstation VRAM, real training throughput, full CUDA.
A hobbyist who wants 128GB cheap and is comfortable tuning	~$2–3k	Strix Halo (eyes open)	Cheapest 128GB box, but expect Vulkan-flag weekends and 180GB/s bandwidth.
Anyone who can wait 6 months	—	Wait	Disaggregated inference + new Apple/NVIDIA SKUs are likely to land Q4 2026.

Where this fits

Hardware is half the picture. Once you've picked a box, the next question is which models actually run well on it and how to operationalise the stack. Three pages on Codersera carry the rest of the load:

Self-hosting LLMs — the complete guide (2026) — runtimes, quantisation, serving, monitoring, security. The canonical "now what do I run on this $4k box" page.
Apple Silicon LLMs — the complete guide (2026) — MLX, Metal, Ollama, LM Studio. Deep dive on the M5 Max path specifically.
Open-Source LLMs Landscape (2026) — which models actually run well on each tier of hardware, ranked and compared quarterly.
Local AI Model Picker — our free tool that takes your hardware specs and recommends the largest model you can comfortably run.

FAQ

Which one is best for inference only?

DGX Spark at $4k. CUDA parity with the cloud, every inference engine works unmodified, and the practitioner benchmarks have it ahead of Strix Halo on token generation by a comfortable margin.

Which one is best for training / fine-tuning?

RTX 6000 Pro. It's the only box on this list with real workstation-grade compute and ECC VRAM. The Spark and Strix Halo can do LoRA on small models; for anything serious you want the 6000 Pro.

Which one is best for long-running agent loops?

DGX Spark or M5 Max, depending on whether your agent stack is CUDA-bound or cross-platform. The Strix Halo is not currently recommended for production agent workloads — the bandwidth ceiling hurts.

What's the best option under $5,000?

DGX Spark at $4k for an inference-focused box, or M5 Max 128GB at $5k if you want it to double as your daily driver. We'd skew toward the Spark for pure LLM work and the Mac for everything else.

MacBook Pro M5 Max vs Mac Studio M5 Ultra?

If you're stationary, the Studio at the same memory tier is slightly faster and meaningfully cheaper. If you travel or work from multiple locations, the MacBook Pro wins for the obvious reason. There's no LLM-specific reason to prefer one over the other beyond raw bandwidth.

Can I run Llama 4 70B on these boxes?

Yes, on all four — quantised. Llama 4 70B at Q4 sits comfortably in 48–64GB depending on context length, well under the 128GB ceiling on the Spark / Strix Halo / M5 Max, and trivially within the RTX 6000 Pro's 96GB.

Can I run DeepSeek V4 locally?

The smaller distillations (16B, 33B), yes — on any of these boxes. The full DeepSeek V4 weights are far too large for any single-desktop setup; that's a multi-GPU server workload. See our DeepSeek V4 complete guide for the model-side details.

Does Strix Halo work with MLX?

No. MLX is Apple-Silicon-only. On Strix Halo you're choosing between ROCm and Vulkan, and the current data suggests Vulkan is the better default.

Does the Spark fit on a normal desk?

Yes. It's roughly the footprint of a Mac Mini, runs quiet enough for an office, and draws under 200W. That part of the pitch is real.

Is it worth waiting for the next generation?

If you can wait 6 months and you're not bottlenecked today: yes. Disaggregated inference software, Apple's M6 generation, and a likely NVIDIA Spark refresh all land in late 2026 / early 2027. If you need a box now to ship product, none of the current options are bad — pick the one that matches your software stack and move on.

Want the full picture? Read our Self-Hosting LLMs — Complete Guide (2026) — the canonical landing page for runtimes, quantisation, serving stacks, and operations once you've picked your hardware.

Need help speccing or deploying your local LLM stack?

Codersera connects you with vetted remote developers who ship LLM integrations and self-hosted inference stacks daily. Hire a developer or partner with us.