Ollama

Local AI Runtime Update: What Shipped in Ollama, vLLM, llama.cpp, MLX, and LM Studio in May 2026

May 2026 was a heavy ship month for local AI runtimes. Ollama added Codex App support. vLLM 0.21 stabilised DeepSeek V4 on Blackwell. llama.cpp merged MTP speculative decoding. MLX hit 4x faster on M5. LM Studio shipped stable MTP. Practical runtime-by-runtime changelog.

Published 28 May 2026 • Updated 07 Jun 2026 • 12 min read

Quick answer. May 2026 shipped material upgrades across every major local AI runtime. Ollama went 0.23.0 to 0.24.0 in 11 days, adding Codex App support, Gemma 4 MTP speculative decoding via the MLX runner, and a reworked MLX sampler (Claude Desktop support was added in 0.23.0 and then withdrawn in 0.23.2 because the integration was limited to Anthropic models). vLLM v0.21.0 stabilised DeepSeek V4 on Blackwell with a new TOKENSPEED_MLA backend and made speculative decoding respect reasoning budgets. llama.cpp merged Qwen 3.6 MTP support (PR #22673) and shipped Windows CUDA 13.1 prebuilts at build b9196. MLX 0.31.x plus macOS 26.2 unlocked M5 Neural Accelerators for up to 4x faster TTFT. LM Studio 0.4.13 added parallel vision predictions; 0.4.14 promoted MTP speculative decoding to stable.

If you maintain a local-LLM stack, the last three weeks of May 2026 mattered. Five runtimes shipped real changes — not just version bumps. This is the practical changelog: what landed, where to find it, and what it means if you are picking a runtime today.

This update is a companion to our broader Ollama vs LM Studio vs vLLM vs llama.cpp vs MLX (2026) comparison — the comparison covers the trade-offs and target users; this post covers the ship velocity in one specific month.

What shipped in Ollama in May 2026?

Ollama published five point releases (0.23.0 through 0.23.4) and one minor (0.24.0) in 11 days. The headline themes were the new ollama launch command growing past CLI coding tools into desktop apps, and the MLX runner getting smarter on Apple Silicon.

v0.23.0 — May 3, 2026

ollama launch opencode moved to inline config: passes its config via the OPENCODE_CONFIG_ENVIRONMENT environment variable instead of writing a separate file, matching how the other launch integrations work.
Initial Claude Desktop support via ollama launch claude-desktop — connecting Claude Cowork and Claude Code to Ollama Cloud models.
New models added to the registry: NVIDIA Nemotron 3 Omni and Poolside Laguna XS.2.
Gemma 4 renderer updated for better thinking and tool calling.
Model recommendations now update server-side without requiring an Ollama upgrade.

v0.23.1 — May 5, 2026 (the speed bump)

This is the one to pay attention to. Ollama 0.23.1 added Gemma 4 MTP (Multi-Token Prediction) speculative decoding on Mac via the MLX runner — landing the same day Google released the Gemma 4 MTP drafter weights. The result: over 2x speed increase on Gemma 4 31B coding tasks on Apple Silicon. The drafter reuses the target model's KV cache and activations, so no redundant context recalculation eats into the win.

v0.23.2 to v0.23.4 — May 7-14, 2026

v0.23.2 removed Claude Desktop from ollama launch because the third-party integration was limited to Anthropic models, which doesn't align with Ollama's open-model focus. Users who'd already configured it can run ollama launch claude-desktop --restore to revert. This is the kind of housekeeping that's worth knowing about if you scripted the 0.23.0 integration into a setup tool.
0.23.3 had a Windows 11 high-memory bug; promptly fixed in 0.23.4.
0.23.4: ollama launch opencode now supports vision models with image inputs; fixed formatting of Claude tool results when using local image paths.

v0.24.0 — May 14, 2026

Released the same day as 0.23.4, Ollama 0.24 is the headline drop of the month:

Codex App support via ollama launch codex-app. OpenAI's desktop Codex experience runs against Ollama models: parallel-thread worktrees, built-in git, browser-based local server inspection, review mode for code commenting. Supported models include kimi-k2.6, glm-5.1, nemotron-3-super, gemma4:31b, and qwen3.6. The launch command bypasses manual env vars, custom endpoints, and config.toml.
Reworked MLX sampler for improved generation quality on Apple Silicon.
/api/show response caching — median latency improved ~6.7x, which makes integrations like VS Code feel dramatically faster on cold model lookups.
ollama launch opencode picked up vision-model image-input support (carried over from 0.23.4).

What shipped in vLLM in May 2026?

vLLM had a quieter calendar month but a heavier engineering month: v0.21.0 landed on May 15, 2026 as a stabilisation release on top of the v0.20.0 base that introduced DeepSeek V4. Plus an EAGLE 3.1 reveal late in the month for the next minor.

vLLM v0.21.0 — May 15, 2026

DeepSeek V4 stabilisation + perf — primary focus of the release.
KV Offload now integrates with the Hybrid Memory Allocator, including scheduler-side sliding-window group support and full HMA enablement. Concrete throughput win on hybrid-architecture models that previously wasted up to 79.6% of KV-cache capacity on mllama, 56.25% on Ministral.
Speculative decoding now respects reasoning / thinking budgets — correct spec decode on reasoning models, which had been a quiet correctness bug.
New TOKENSPEED_MLA attention backend for DeepSeek-R1 / Kimi-K2.5 prefill and decode on Blackwell GPUs.
Persistent MLA for the sparse backend; faster per-token FP8 group-quant packed kernel; FP8 on NVIDIA Thor / SM110; CUTLASS scaled mm for non-compatible sizes.
New model architectures: MiMo-V2.5, Laguna XS.2, Moondream3, Qianfan-OCR, Cohere MoE, Cohere Eagle.
Speculative decoding added for EAGLE-for-Mistral, Gemma 4 MTP, MTP for MiMo-V2.5, Cohere Eagle.
Breaking: C++20 is now required for PyTorch compatibility; Transformers v4 is deprecated. New env var VLLM_SKIP_MODEL_NAME_VALIDATION is your escape hatch.
Docker image trimmed ~2.5 GB via deferred FlashInfer cubin download. vLLM ≥ 0.21.0 uses ROCm 7.2.2.

A v0.21.1rc0 also went up on May 15 but didn't reach GA by the end of the month.

EAGLE 3.1 — May 26, 2026 (lands in v0.22.0)

Joint EAGLE-team / vLLM / TorchSpec announcement. EAGLE 3.1 fixes a long-running correctness regression where speculative decoding loses acceptance length under long context, unusual chat templates, and out-of-distribution system prompts. The root cause is "attention drift": as speculation depth increases, the drafter gradually shifts attention away from sink tokens toward its own generated tokens. The fix adds FC normalization after each target hidden state and feeds post-norm hidden states into the next decoding step.

Result: up to 2x longer acceptance length on long-context workloads vs EAGLE 3, better robustness to chat templates and system prompts, more stable acceptance under varied serving conditions. It is config-driven and backward-compatible with EAGLE 3 checkpoints. Ships in vLLM v0.22.0.

Quick context if you're picking between runtimes for production: vLLM benchmarks from May 2026 show ~4,741 tok/s on GPT-OSS-120B at 100 concurrent on 2x H100s, and roughly 2.3x Ollama's throughput under 8 concurrent users on Llama 3 8B. The throughput gap widens as concurrency rises. For more on this trade-off, see our self-hosting LLMs guide.

What shipped in llama.cpp in May 2026?

llama.cpp does not use semver — it ships continuous builds. April's range was b8607 through b8779; May continued the cadence with notable milestones at b9196 (May 18, 2026, Windows prebuilts for CUDA 13.1, Vulkan, HIP, SYCL) and a build on May 28 that bumped to CUDA 13.3 DLLs.

MTP speculative decoding merged (PR #22673)

The big landing in May: Multi-Token Prediction support by am17an, merged into the mainline master branch. Works against Qwen 3.6's MTP heads — the model drafts a few tokens ahead and the verifier checks them in one pass.

Qwen 3.6 27B dense: ~2x generation throughput in single-user scenarios.
Qwen 3.6 35B-A3B MoE: more complicated. At batch=1 the expert-union overhead — every drafted token has a high chance of pulling a different expert slice into compute, and the verifier loads the union — can wipe out the speed gain. Consumer-grade benchmarks on RTX 3090 show no net speedup over the autoregressive baseline on MoE single-stream. Production servers can amortise via batching; solo users running MoE locally should not expect MTP to be a free win. Practical local-LLM workflows often involve choosing between dense and MoE for exactly this reason.

TurboQuant tracking

Discussion #20969 in the llama.cpp tracker mirrors the ICLR 2026 TurboQuant paper (Zandieh et al.) with a working CPU implementation (18/18 tests passing, MSE within 1% of the paper). TQ3 gives 4.9x compression vs FP16; TQ4 gives 3.8x. Forks (atomic-llama-cpp-turboquant) bundle it with Gemma 4 MTP and Qwen 3.6 NextN speculative decoding and report +30-50% throughput.

Backend perf updates

CUDA: kernel fusion accelerates token generation; the mmvq small-batch path cut eval time ~40% vs vLLM on Qwen single-stream. On RTX 4090, baseline 77 t/s -> optimised 96 t/s (+24%), prompt eval +96% to 252 t/s.
Metal: optimised concat kernel with row batching for small widths to improve GPU occupancy; test_cpy extended for reshape ops; GGML_OP_SET kernel thread fix.
Vulkan: now genuinely competitive — on Qwen Coder Next, Vulkan can beat CUDA by ~40% on some hardware. Vulkan still has the broadest hardware coverage: Nvidia, AMD, Intel, older GPUs, Apple/Asahi via MoltenVK.
Multi-backend builds via -DGGML_CUDA=ON -DGGML_VULKAN=ON; runtime --device flag to pin backend.

What shipped in MLX in May 2026?

The MLX core (ml-explore/mlx) is on the 0.31.x line — roughly a release every 3-4 weeks. 0.31.2 is the current docs reference; recent patches (0.31.1) added CUDA quantized GEMV and fp16 accumulation for 4-bit GEMV. ml-explore/mlx was updated May 24, 2026.

The big M5 story

Apple's research team published "Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU" in May 2026. The takeaway: every M5 GPU core now has dedicated matrix-multiplication hardware (Neural Accelerators), and MLX is the only framework that targets them.

Up to 4x speedup for time-to-first-token on language model inference vs M4 baseline.
3.8x faster generating a 1024x1024 image with FLUX-dev-4bit (12B) vs M4.
30-60% faster on most workloads on M5 vs M4 overall.
M5's 153 GB/s memory bandwidth contributes +19-27% generation speed vs M4 even before the Neural Accelerators kick in.
Requires macOS 26.2 or later AND MLX 0.30.0+ to unlock the Neural Accelerators path. If you are stuck on an older macOS on an M5, you are only getting the memory-bandwidth gains.

MLX for VLMs

For vision-language models, mlx-vlm ships separate handling of the vision projector (which most cross-framework runners do not). Qwen 3.6 (35B-A3B released April 16, 27B released April 22) has both mlx-lm and mlx-vlm support, with community Hugging Face repos (Unsloth) shipping 4-bit and 8-bit MLX quants ready to load.

MLX in Ollama and LM Studio

The interesting structural shift in May: Ollama and LM Studio both lean harder on MLX as the Apple-Silicon backend. Ollama 0.24's reworked MLX sampler and 0.23.1's Gemma 4 MTP-via-MLX both rely on the upstream framework. LM Studio's mlx-engine v1.8.1 added parallel predictions for vision models. The practical implication: if you are on Apple Silicon, you are increasingly running MLX under the hood no matter which wrapper you pick.

What shipped in LM Studio in May 2026?

LM Studio shipped two notable builds on the 0.4.13 line in May.

0.4.13 Build 1 — May 13, 2026

mlx-engine v1.8.1 — significant performance improvements.
Parallel predictions for vision-capable models: Qwen 3.5, Qwen 3.6, Gemma 4. This is the "run multiple vision inferences in parallel on one model load" surface that finally makes local VLMs feel usable for batch workloads.
Fixed a chat-input bug where newlines were compacted on paste.
Bug fixes and security hardening; recommended for all users.

0.4.14 — late May 2026

Stable release of MTP Speculative Decoding in 0.4.14 Build 4 — speeds up generation with models that ship built-in multi-token-prediction heads (Gemma 4, Qwen 3.6 inherit the speed-up).
Real-world throughput: 1.5x to 3x tokens/sec depending on model, task, and hardware. Unlike classic draft-model speculative decoding, MTP doesn't need a second model loaded in VRAM — the heads are part of the target model.

Context from earlier 2026 still relevant

MCP Host since 0.3.17 — connect MCP servers to the app and expose tools to local models.
OAuth for MCP servers since 0.4.10 — supports MCP servers that require sign-in.
llmster headless daemon — shipped with 0.4.0 as the GUI-less core for Linux / Windows / Mac headless deploys.
REST API on localhost:1234 mirrors OpenAI's /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models; OpenAI- and Anthropic-compatible.

How do the May 2026 changes affect runtime selection?

The May 2026 batch did not redraw the boundaries between runtimes, but it sharpened them:

Ollama is now the cleanest path to wire local OR cloud models into Codex App, Claude Code, OpenCode, Droid, and similar developer surfaces with a single ollama launch command. Strongest case for the "I just want a Codex / Claude Code / Cursor backend" user. Claude Desktop is the explicit exception — Anthropic's third-party hook is model-restricted, so Ollama withdrew that integration in 0.23.2.
vLLM remains the production-serving choice. v0.21.0 makes the DeepSeek V4 + Kimi K2.6 + Cohere MoE story production-grade on Blackwell, and EAGLE 3.1 (v0.22.0, coming) will make long-context speculative decoding reliable for the first time.
llama.cpp remains the most portable, lowest-floor runtime — and now has real MTP speculative decoding for Qwen 3.6 dense models. Caveat: MoE single-stream MTP is not a win on consumer hardware.
MLX is now mandatory on M5 — if you bought one and aren't using MLX directly or through Ollama/LM Studio, you're leaving 3-4x perf on the table. Requires macOS 26.2.
LM Studio is the easiest GUI front-end for MLX with the new parallel-vision and MTP work. Best for solo developers who want a desktop UI plus an OpenAI-compatible local API.

How Codersera helps teams self-host

If you are setting up a local-LLM stack — picking a runtime, sizing GPUs, wiring tools — Codersera engineers handle the choices our content covers: vLLM production deploys on Blackwell, Apple Silicon dev rigs, Ollama integration with Claude Code and Codex workflows. We supply vetted developers who have already shipped on these runtimes. Extend your engineering team when the bench is overloaded.

Updated for June 2026: what to watch next

This changelog is kept current as the runtimes keep shipping. As of early June 2026, here is what carried over from May and what to track next — check each project's release page before you upgrade, because the cadence is fast and breaking changes land without much warning.

vLLM v0.22.0 (EAGLE 3.1). The biggest pending item from May. EAGLE 3.1 was announced May 26, 2026 to fix the long-context "attention drift" acceptance-length regression and ship inside v0.22.0. It is config-driven and backward-compatible with EAGLE 3 checkpoints, so existing speculative-decoding setups upgrade in place. The v0.21.1rc0 cut that went up on May 15 is also worth watching for GA. Track the vLLM releases page.
Ollama post-0.24. May closed on 0.24.0 (Codex App support, reworked MLX sampler, cached /api/show). The ollama launch integration surface is where new work keeps landing — confirm any desktop-app integration you script against is still supported before relying on it, given the Claude Desktop add-then-remove in 0.23.x.
llama.cpp continuous builds. May ended with a build bumping to CUDA 13.3 DLLs past b9196. The MTP merge (PR #22673) is now in master, so dense Qwen 3.6 gets the ~2x single-user win out of the box — but the MoE single-stream caveat still stands on consumer GPUs.
MLX on M5. The macOS 26.2 + MLX 0.30.0+ requirement to unlock the M5 Neural Accelerators (up to 4x TTFT) is unchanged. If you are on an M5 and not on macOS 26.2, that path is still locked. Watch the MLX releases for the 0.31.x to 0.32 line.
LM Studio. MTP speculative decoding is stable as of 0.4.14 Build 4; the practical 1.5x-3x throughput range depends on model and hardware, and no second draft model is loaded into VRAM.

Bottom line: May 2026 sharpened the boundaries between these runtimes rather than redrawing them, and June's releases are mostly follow-through on that work. If you are standing up a stack now, the runtime-selection guidance above still holds. For the broader trade-offs, the full runtime comparison and the self-hosting LLMs guide stay the canonical references.

FAQ

Which local AI runtime should I use in May 2026?

If you are on an Apple Silicon Mac for personal use: MLX directly, or LM Studio / Ollama on top of MLX. If you are running a production inference server with concurrency: vLLM v0.21.0. If you want maximum portability and don't need top throughput: llama.cpp. If you want a desktop GUI plus an OpenAI-compatible local API: LM Studio. See our full comparison.

What was the biggest Ollama release in May 2026?

v0.24.0 on May 14, 2026 added ollama launch codex-app for OpenAI's desktop Codex experience, reworked the MLX sampler for better generation quality on Apple Silicon, and cached /api/show responses for ~6.7x lower median latency on integrations like VS Code. Note: Claude Desktop support was added in 0.23.0 then removed in 0.23.2 because the third-party hook was limited to Anthropic models — not because of an Ollama bug.

Does vLLM v0.21.0 break my existing setup?

Two breaking changes worth checking before you upgrade. First, C++20 is now required for PyTorch compatibility — your build toolchain needs to support it. Second, Transformers v4 is deprecated — pin v5 or accept the deprecation warnings now and migrate before they become errors. There is also a new VLLM_SKIP_MODEL_NAME_VALIDATION env var if model-name validation is too strict for your custom checkpoints.

Does llama.cpp's new MTP support speed up MoE models like Qwen 3.6 35B-A3B?

It speeds up the dense Qwen 3.6 27B by roughly 2x in single-user scenarios. For the 35B-A3B MoE at batch=1, multiple independent benchmarks on consumer GPUs (RTX 3090) show no net speedup over the autoregressive baseline — the expert-union overhead in the verifier pass eats the win. Production servers that batch requests can still benefit. If you are running MoE locally on one GPU, do not expect MTP to be a free win.

What does MLX on M5 actually give me?

Up to 4x faster time-to-first-token for LLM inference, 3.8x faster FLUX 1024x1024 image generation, and 30-60% faster across most ML workloads compared to M4. You need macOS 26.2 or later AND MLX 0.30.0+ to unlock the Neural Accelerators path. If you are running older macOS on an M5, you are only getting the memory-bandwidth gains (still +19-27% vs M4).

Is LM Studio the right way to use MLX?

It is one of the easiest. LM Studio 0.4.13's mlx-engine v1.8.1 added parallel predictions for vision models, and 0.4.14 Build 4 promoted MTP speculative decoding to stable. The MCP Host functionality from 0.3.17 plus OAuth-for-MCP from 0.4.10 also make it a viable tool-using local agent surface. The llmster daemon means you don't need the GUI to use it.

When will EAGLE 3.1 be in vLLM?

The announcement (May 26, 2026) confirms EAGLE 3.1 ships in vLLM v0.22.0. As a config-driven extension it stays backward-compatible with EAGLE 3 checkpoints, so existing speculative-decoding setups will not need to be rebuilt.