Quick answer. For one-developer prototyping on any OS, pick Ollama. For a GUI-first model browser on Mac or Windows, pick LM Studio. For multi-user production serving on NVIDIA or AMD GPUs, pick vLLM — it hits roughly 16-20x Ollama’s concurrent throughput thanks to PagedAttention and continuous batching. For embedded or weird-hardware deployments, drop down to llama.cpp directly. On Apple Silicon, MLX is now the fastest path — and Ollama 0.19+ uses MLX under the hood on M-series chips.
In 2026, “run an LLM locally” is no longer one decision — it’s five. The ecosystem has cleanly bifurcated into runtimes optimized for different workloads: hobby vs production, CLI vs GUI, NVIDIA vs Apple Silicon, single-user vs concurrent serving. Pick the wrong one and you either burn three days wiring up paged attention you didn’t need, or you ship a prototype that falls over at 10 concurrent users.
This guide compares the five dominant runtimes head-to-head: Ollama, LM Studio, vLLM, llama.cpp, and MLX. Per-runtime deep dives, a throughput-and-feature matrix, and a decision tree at the bottom so you can ship today.
The five runtimes at a glance
| Runtime | Primary form | Best for | Model format | OS / hardware sweet spot | Concurrent throughput |
|---|---|---|---|---|---|
| Ollama | CLI + REST API | Developer prototyping | GGUF (llama.cpp-derived), MLX on M-series | macOS / Linux / Windows; single user | ~40 tok/s peak |
| LM Studio | Desktop GUI + headless mode | Model browsing, GUI users, solo devs | GGUF + MLX | macOS / Windows; integrated GPU friendly | ~50-90 tok/s with continuous batching |
| vLLM | Python server | Production multi-user serving | HF safetensors, AWQ, GPTQ, FP8 | Linux + NVIDIA/AMD GPUs (A100, H100, MI300); Windows via WSL2; Mac via vllm-metal plugin | ~800-12,500 tok/s |
| llama.cpp | C/C++ binary + server | Embedded, edge, weird hardware | GGUF (it invented the format) | Anywhere a C compiler runs — CPU, CUDA, ROCm, Vulkan, Metal | Comparable to Ollama (Ollama wraps it) |
| MLX | Python library | Apple Silicon native, research | MLX safetensors, mlx_lm format | macOS only; M1+ chips | ~130-230 tok/s single user |
The headline: these are not direct substitutes. Ollama and LM Studio are experience layers; llama.cpp and MLX are engines; vLLM is a serving system. Ollama and LM Studio both wrap llama.cpp (and increasingly MLX on Mac). vLLM is its own animal — built ground-up for GPU serving, not local-first development.
What is Ollama and when should you use it?
Ollama is the easiest entry point to local LLMs. A one-line install, a Docker-style ollama pull qwen3.5 to grab a model, and ollama run qwen3.5 to chat. Under the hood it’s a Go process wrapping llama.cpp (on x86 / non-Apple hardware) or MLX (on Apple Silicon as of Ollama 0.19, March 2026).
Strengths.
- Zero-config model management — one command pulls, quantizes, and serves.
- OpenAI-compatible REST API at
localhost:11434. Most agentic frameworks (Cursor, Continue, Aider, OpenWebUI) target Ollama by default. - Now MLX-accelerated on M-series Macs — the 0.19 preview saw decode jump from 58 to 112 tok/s on an M5 Max running Qwen 3.5-35B-A3B.
Weaknesses.
- One concurrent request at a time by default. Under load, throughput collapses; concurrent benchmarks show vLLM hitting ~793 tok/s while Ollama tops out at ~41 tok/s.
- No GUI — CLI-only unless you bolt on OpenWebUI.
- Model library is curated; rare or custom models still require manual GGUF conversion.
Pick Ollama if you’re a developer prototyping locally, building an agent on top of an OpenAI-compatible API, or running an internal tool for <5 users. For deeper Ollama-specific workflows, the self-hosting LLMs complete guide covers production hardening, reverse-proxy setup, and model selection.
What is LM Studio and when should you use it?
LM Studio is the GUI-first answer to Ollama. It looks like a chat app, but its real superpower is the built-in model browser — you search Hugging Face inside the app, see quantization recommendations based on your RAM and GPU, and click Download. No CLI, no JSON config.
What changed in 2026.
- 0.4.0 (Jan 2026) added llmster, a pure headless mode for servers and CI. LM Studio can now run on a box with no display attached.
- 0.4.0 also shipped continuous batching via llama.cpp’s parallel-slot feature, finally letting LM Studio serve multiple concurrent requests.
- 0.4.2 (Feb 2026) extended continuous batching to the MLX engine. On Mac, LM Studio is now genuinely competitive with vLLM for small concurrent workloads.
Strengths.
- Polished UX for non-developers — designers, PMs, and writers can run a 14B model without touching a terminal.
- Excellent Vulkan support, which means integrated GPUs (AMD APUs, Intel Arc) often outperform Ollama’s CUDA-only fast path.
- Built-in playground, system-prompt presets, and chat history persistence.
Weaknesses.
- Not open source (the engine is, the desktop shell isn’t). Some teams won’t deploy it for licensing reasons.
- No Linux desktop — only macOS, Windows, and Linux headless via llmster.
- Resource overhead from the Electron shell — ~300-500 MB RSS before you even load a model.
Pick LM Studio if you want a model browser as good as the App Store, if your team isn’t shell-fluent, or if you’re on an AMD APU laptop where Vulkan offloading actually matters.
What is vLLM and when should you use it?
vLLM is the production serving runtime — the one you reach for when “works on my Mac” is no longer the goal and “serves 500 concurrent users at 80ms P99” is. It was built at UC Berkeley in 2023 and is now the standard inference engine behind most open-source LLM API providers (Together, Anyscale, Fireworks, plus a long tail of self-hosters).
What makes vLLM different.
- PagedAttention. The KV cache is managed in fixed-size blocks like OS virtual memory pages, eliminating the 60-80% memory waste that naive implementations suffer. Wasted VRAM drops to under 4%.
- Continuous batching. Requests are batched at the token level, not the request level. A new user’s prompt joins an in-flight batch on the next decode step instead of waiting for the current batch to finish.
- Tensor parallelism. Split a 70B model across 4xH100s with one config line; vLLM handles the all-reduce.
- Prefix caching. Shared system prompts are computed once and reused across requests — massive win for RAG and agent workflows.
Throughput. On a single A100 80GB, vLLM hits ~2,500 tok/s for Llama 2 7B and ~800 tok/s for 13B. On an H100, Llama 3.1 8B in BF16 reaches ~12,500 tok/s. Under 8-concurrent-user benchmarks, vLLM delivers ~2.3x higher throughput than Ollama; under heavy load the gap widens to 16-20x.
Weaknesses.
- Linux + NVIDIA/AMD GPUs is the supported path. No native Windows (use WSL2). Mac is only via the community
vllm-metalplugin, not core vLLM — treat it as experimental. - Steep learning curve. You’ll touch Python, CUDA versions, ROCm if on AMD, NCCL for multi-GPU. Plan a day of devops, not an evening.
- Models are happiest in HF safetensors (with AWQ, GPTQ, FP8, or NVFP4 quantization). vLLM does have experimental GGUF loading since v0.7, but throughput is significantly below llama.cpp’s native GGUF path and the maintainers have floated deprecating it. Treat vLLM and llama.cpp as living in different format universes — don’t plan production around vLLM-on-GGUF.
Pick vLLM when you’re serving an API to >500 requests/hour with SLAs, when you have at least one A100/H100/MI300 (or rent one), and when single-digit-millisecond P99 matters. The Llama 4 complete guide covers vLLM deployment recipes for the Meta-released models.
What is llama.cpp and when should you use it directly?
llama.cpp is the C/C++ reference implementation that started the local-LLM movement. Georgi Gerganov’s original goal was to run Llama on a MacBook with no Python. Five years later, llama.cpp is the actual inference engine behind Ollama, LM Studio, Jan, OpenWebUI, GPT4All, and a dozen other wrappers. When you use Ollama, you’re using llama.cpp with a friendlier face.
Why use llama.cpp directly?
- Weird hardware. Llama.cpp runs on Raspberry Pi, Android phones, RISC-V boards, and 10-year-old laptops with AVX2 CPUs. Ollama needs a more modern target.
- Embedded. If you’re shipping a desktop app that bundles an LLM, llama.cpp is a 5 MB static binary you can ship inside your app. Ollama assumes a system daemon.
- Quantization control. Llama.cpp lets you create your own GGUFs with 1.5-bit through 8-bit integer quantization, importance matrices for ≤3-bit accuracy retention, and per-tensor mixing. Q4_K_M is the default sweet spot; Q5_K_M trades 20% size for noticeably better quality.
- The lowest level of abstraction. If something breaks at midnight, llama.cpp is open C++ you can read and patch. Ollama is a Go process wrapping it.
Throughput on the same hardware sits between raw llama.cpp and Ollama. Ollama wraps llama.cpp in a Go process that adds ~50% overhead on Mac — community benchmarks have raw llama.cpp Metal at ~89 tok/s on the same M4 Max where Ollama hits ~43 tok/s. On Apple Silicon, llama.cpp’s Metal backend still lags the MLX-native path (roughly 1.4–1.6x on dense models, up to 3x on MoE), and the 2026 community is migrating Mac workloads to MLX where available.
Pick llama.cpp directly when you need embedded deployment, custom quantization, or guaranteed open-source license clarity. Most other times, use Ollama and inherit llama.cpp’s strengths.
What is MLX and why does it matter on Apple Silicon?
MLX is Apple’s ML framework, released late 2023 and now (in 2026) the fastest path to local inference on M-series Macs. It was designed from day one for the unified memory architecture — CPU and GPU share the same memory pool, so you avoid the GPU-to-CPU copies that dominate inference on discrete-GPU systems.
Why MLX wins on Apple Silicon.
- Unified memory. A 70B model in 4-bit fits in 40 GB of unified RAM on an M3 Max; on a discrete-GPU system you’d need a 48GB GPU AND 40GB of system RAM for the staging copy.
- M5 Neural Accelerators. Apple’s M5 chip ships dedicated matrix-multiplication units. Apple’s own benchmarks show up to 4x speedup in time-to-first-token vs M4 baseline.
- Real measured throughput. On a Mac mini M4 Pro running Qwen3-Coder-30B-A3B (a MoE model), MLX achieved ~130 tok/s vs Ollama’s legacy llama.cpp backend at ~43 tok/s — a 3x gap that’s partly Ollama’s Go-wrapper overhead (raw llama.cpp Metal on the same chip clocks ~89 tok/s). MLX-vs-raw-llama.cpp is closer to 1.4–1.6x on dense models, up to 3x on MoE. Under steady-state with the prompt cache warm, MLX sustained ~230 tok/s.
How MLX changed Ollama. On March 30, 2026, Ollama announced its Apple Silicon path is now powered by MLX. Ollama 0.19 on an M5 Max running Qwen 3.5-35B-A3B saw prefill jump from 1,154 to 1,810 tok/s (57% faster) and decode jump from 58 to 112 tok/s (93% faster). LM Studio’s MLX engine sees similar gains.
You rarely interact with MLX directly. You install Ollama or LM Studio, they detect Apple Silicon, and MLX runs underneath. Direct MLX use makes sense for ML researchers fine-tuning models on Mac with mlx_lm.lora, or for production Mac deployments where you want zero abstraction overhead.
How do throughput and memory actually compare?
Single-user, all five runtimes feel about the same on the same hardware — ~30 tok/s on an RTX 4090 with a 24B model. The differences show up in three places:
- Concurrent serving. vLLM’s PagedAttention and continuous batching give it a 16-20x throughput advantage over Ollama under load. If you’re serving >10 simultaneous users, this is the only number that matters.
- Memory efficiency. vLLM wastes <4% of VRAM on KV cache fragmentation. llama.cpp/Ollama waste 30-50% by default. On a 24GB card with a 70B Q4 model, that’s the difference between “it loads” and “OOM at 2K context.”
- Apple Silicon. MLX is 2-3x faster than the legacy llama.cpp Metal path. If you’re on a Mac and not using an MLX-backed runtime, you’re leaving half your performance on the table.
Which models does each runtime support?
- Ollama: ~150 curated models in its library, plus any GGUF you import. Llama 3/4, Qwen 3.5, DeepSeek V4, Gemma 4, Mistral, Phi, Kimi K2.6, all the usual suspects.
- LM Studio: Any GGUF or MLX model from Hugging Face. The model browser ranks results by your hardware capability.
- vLLM: Any Hugging Face transformer model in safetensors. The model catalog is the largest of the five — if it’s on HF and the architecture is supported, vLLM runs it.
- llama.cpp: Anything with a GGUF, plus its own conversion scripts from HF safetensors. The longest tail of supported architectures — older models, fine-tunes, and merges all work.
- MLX: Any model on Hugging Face in
mlx-community/namespace, plus self-converted viamlx_lm.convert. Coverage of recent flagship models is excellent; the long tail of fine-tunes lags llama.cpp.
For an opinionated tour of which open-weight models are worth running locally in 2026, see the open-source LLMs landscape and the deep dive on DeepSeek V4.
How do deployment patterns differ?
- Ollama: System daemon on
localhost:11434. Drop in behind nginx, expose as an OpenAI-compatible API. Docker image is the most common production form. - LM Studio: Either desktop GUI (single user) or llmster headless on a server. Best for internal teams with one shared Mac mini or workstation.
- vLLM: Containerized Python serving on Kubernetes, with horizontal pod autoscaling and a load balancer in front. The full production pattern: vLLM + Ray Serve + a feature flag store + per-tenant rate limits.
- llama.cpp: Either embedded in another binary (Tauri apps, Electron with native modules) or as
llama-serverbehind nginx. Lowest operational overhead of the five. - MLX: Python script with
mlx_lm.server, or wrapped by Ollama/LM Studio. Production Mac deployments typically run mlx_lm.server behind a launchd service.
Decision tree: which runtime should you pick?
- Are you serving an API to >100 concurrent users with SLAs? — vLLM on Linux + NVIDIA/AMD. No other option scales the same way.
- Are you on a Mac with an M-series chip? — Ollama (which now uses MLX) for general work, LM Studio for GUI-first workflows, direct MLX for research and fine-tuning.
- Are you on Windows or AMD/Intel iGPU? — LM Studio. Its Vulkan offloading beats Ollama’s CUDA-only fast path on integrated GPUs.
- Are you a developer prototyping an agent or building an internal tool? — Ollama. Best-in-class OpenAI compatibility and a model library that just works.
- Are you shipping a desktop app that embeds an LLM? — llama.cpp. The only runtime designed to be statically linked into your binary.
- Are you on Raspberry Pi, mobile, or pre-2020 hardware? — llama.cpp directly. Ollama’s minimum bar is higher.
- Are you a researcher fine-tuning or experimenting with custom architectures? — MLX on Mac, vLLM (via its training-compatible siblings) on Linux GPUs.
The bottom line
The five runtimes are not competitors so much as different layers of the same stack. llama.cpp and MLX are the engines. Ollama and LM Studio are the developer-experience layers on top. vLLM is its own thing — built for a different problem (concurrent serving) on different hardware (data-center GPUs).
Most teams in 2026 use at least two: Ollama or LM Studio for local development and demos, plus vLLM when something ships to production. That’s the right pattern. Don’t try to use vLLM as your laptop chat client, and don’t try to use Ollama as your customer-facing API.
FAQ
Is Ollama just llama.cpp with a wrapper?
Historically yes — on x86 hardware, Ollama is a Go process wrapping llama.cpp. As of Ollama 0.19 (March 2026), the Apple Silicon path uses MLX instead. So Ollama is now “llama.cpp on x86, MLX on M-series, plus a model registry and HTTP API on top.”
Can vLLM run GGUF models?
Yes, but only as an experimental single-file loader (added in v0.7), and throughput is significantly below llama.cpp’s native GGUF path. The vLLM maintainers have floated deprecating GGUF and bitsandbytes support entirely because usage is <1% of installs. Production-safe vLLM means HF safetensors with AWQ, GPTQ, FP8, or NVFP4 — not GGUF. If you have a GGUF you want at production scale, re-quantize from the original safetensors weights or stay on llama.cpp.
Does LM Studio have a headless mode for servers?
Yes, since version 0.4.0 (January 2026). The feature is called llmster and runs LM Studio’s inference engine on a box with no display attached. It also added continuous batching via llama.cpp’s parallel-slot support, which 0.4.2 extended to the MLX engine.
What is the fastest local LLM runtime on Mac in 2026?
MLX, either used directly or via Ollama 0.19+/LM Studio. On a Mac mini M4 Pro running Qwen3-Coder-30B-A3B, MLX hit ~130 tok/s vs the legacy llama.cpp Metal backend at ~43 tok/s. On M5 chips with the new Neural Accelerators, Apple measured up to 4x improvements in time-to-first-token vs M4.
How much faster is vLLM than Ollama under concurrent load?
Roughly 16-20x at peak throughput. At 8 concurrent users, Red Hat’s 2026 benchmarks show ~2.3x. At higher concurrency, vLLM hits ~793 tok/s vs Ollama’s ~41 tok/s, with P99 latency of 80 ms vs Ollama’s 673 ms. The gap is PagedAttention plus continuous batching — Ollama doesn’t do either.
Can I use the same model file across all five runtimes?
No. GGUF (used by Ollama and llama.cpp) is incompatible with vLLM’s safetensors-based formats, and MLX uses its own format converted via mlx_lm.convert. In practice you download separate copies of a model in each format from Hugging Face — storage is cheap, and the conversion overhead isn’t worth doing yourself.
Which runtime should I learn first as a developer?
Ollama. The investment compounds: every agentic framework speaks Ollama’s API, and the patterns transfer to vLLM (which also exposes an OpenAI-compatible endpoint) when you scale up. Add llama.cpp directly only when you need embedded deployment, and add MLX only when you’re optimizing on Mac.
Is local-LLM inference actually faster than calling Claude or GPT-5?
For first-token latency on a single request, yes — a local Qwen 3.5-30B on MLX hits time-to-first-token under 200 ms vs 400-800 ms over the network to a hosted API. For raw throughput on long generations, no — hosted APIs run on H100/H200 clusters with batched inference you can’t replicate at home. Local wins on latency, privacy, and cost-per-token-once-amortized; hosted wins on absolute throughput and zero ops.