LM Studio Complete Guide (2026): Run Local LLMs With a Real GUI

What LM Studio is, how to install it on Mac, Windows and Linux, how the OpenAI-compatible server works, MLX vs llama.cpp on Apple Silicon, document chat (RAG), the lms CLI, and where it beats Ollama and llama.cpp.

Quick answer. LM Studio is a free desktop app from Element Labs that downloads and runs open-source LLMs (Llama, DeepSeek, Qwen, Mistral, Gemma, Phi) entirely on your machine. It ships a chat UI, a Hugging Face model browser, an OpenAI-compatible API server on localhost:1234, document chat (RAG), an lms CLI, and the llmster headless daemon. On Apple Silicon it can use Apple's MLX backend, which is typically 30 to 50 percent faster than llama.cpp on Metal. It runs on Windows 10+, macOS 14+ (Apple Silicon only) and Linux, free for personal use.

Updated 2026-05-23.

What LM Studio is

LM Studio is a desktop application for discovering, downloading and running open-source large language models on your own hardware. It is built by Element Labs and is free for personal use. Where Ollama is a server with a CLI bolted on, and llama.cpp is a C++ inference engine, LM Studio is a polished GUI first, with a server, CLI and headless daemon layered underneath.

It targets three audiences in one product: a non-engineer who wants to chat with Llama 3.3 on their laptop, an engineer who wants an OpenAI-compatible endpoint for prototyping, and an ops team that needs a headless inference daemon on a Linux box. The feature surface reflects that mix: a Hugging Face model browser, a chat window with RAG, a Developer tab for serving the model over HTTP, and an lms CLI for scripting.

For the broader landscape (Ollama, llama.cpp, vLLM, Jan, Open WebUI) see our self-hosting LLMs guide and the head-to-head OpenClaw vs LM Studio vs Ollama comparison. This guide is the LM Studio canonical reference.

What problem does LM Studio solve?

Running an open-source LLM locally used to mean cloning llama.cpp, compiling it with the right BLAS flags, hunting down a GGUF quant on Hugging Face, and writing a shell script to start a server. LM Studio collapses that into three clicks: install the app, search for a model, click Download. Two more clicks load it into memory and start a chat.

The deeper problem is offline-by-default inference. Nothing in LM Studio's chat path phones home. Models live on your disk, inference happens on your CPU or GPU, and the OpenAI-compatible server binds to localhost unless you explicitly expose it. For regulated environments (legal, healthcare, finance) this is often the only acceptable shape.

Install LM Studio on Mac, Windows and Linux

Grab the installer from lmstudio.ai. The download page detects your OS automatically.

  • macOS: requires macOS 14 (Sonoma) or newer and an Apple Silicon Mac (M1/M2/M3/M4 family). Intel Macs are unsupported as of LM Studio 0.3.x.
  • Windows: Windows 10 or 11, x64 or ARM64. GPU acceleration via CUDA (NVIDIA), Vulkan or DirectML.
  • Linux: AppImage build covering most distros. GPU acceleration via CUDA (NVIDIA) or ROCm (AMD).

The installer is a normal desktop package (.dmg, .exe, .AppImage). No Docker, no Python venv, no Homebrew formula required. First launch downloads a small runtime bundle for your platform (llama.cpp build, plus MLX engine on macOS).

Loading your first model (GGUF and MLX)

Open the Discover tab in the left sidebar. The search box queries Hugging Face directly. Type a model name ("llama-3.3-8b", "qwen2.5-coder", "deepseek-coder") and you will see every quant available, sized for your machine. LM Studio annotates each quant with a green or yellow badge telling you whether it will fit in your RAM/VRAM at full GPU offload.

Two formats matter:

  • GGUF is the cross-platform format llama.cpp uses. Works on every OS, every GPU backend. Quants like Q4_K_M, Q5_K_M, Q8_0 trade size for quality.
  • MLX is Apple's native ML framework format, used only on Apple Silicon Macs. Same model, repackaged for Apple's GPU and Neural Engine.

On an M-series Mac, prefer MLX where available. LM Studio's MLX backend is typically 30 to 50 percent faster than its llama.cpp/Metal backend on the same hardware, with lower memory pressure. The chat tab works identically once the model is loaded.

The chat UI and document chat (RAG)

The Chat tab is the obvious entry point: a familiar message thread with a system prompt panel on the right. You can spin up multiple conversations, fork from a message, regenerate, edit prior turns, and tweak sampling (temperature, top-p, top-k, repeat penalty) live.

The non-obvious feature is the paperclip icon: attach a PDF, TXT, DOCX, Markdown or code file and LM Studio will either (a) inline it into the context if it fits, or (b) chunk and embed it for retrieval-augmented generation (RAG). The dual-mode behaviour is automatic, based on document length and the loaded model's context window.

RAG mode uses nomic-ai/nomic-embed-text-v1.5-GGUF as the default embedding model and ChromaDB-style chunk retrieval under the hood. You can swap the embedding model from the Developer tab. For a deeper dive on the local-RAG pattern, see our DeepSeek V4 Flash local setup writeup.

Run an OpenAI-compatible API server

Switch to the Developer tab (formerly "Local Server"). Load a model into memory, then click Start Server. By default it binds to http://localhost:1234 and exposes endpoints that mirror OpenAI's REST API exactly:

  • POST /v1/chat/completions
  • POST /v1/completions
  • POST /v1/embeddings
  • GET /v1/models

Because the request and response shapes match OpenAI's, any OpenAI client library works without changes. Point the SDK at your local base URL:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
resp = client.chat.completions.create(
    model="llama-3.3-8b-instruct",
    messages=[{"role": "user", "content": "What is RAG?"}],
)
print(resp.choices[0].message.content)

This unlocks the entire OpenAI ecosystem locally: LangChain, LlamaIndex, Continue.dev, Cursor's local-model mode, Open WebUI, and our own MCP toolchains. The server logs incoming requests in the Developer tab so you can debug payloads end-to-end.

The lms CLI

LM Studio ships an lms CLI installed alongside the GUI. After first launch, run lms bootstrap once to register the binary on your PATH. The CLI covers the same surface as the GUI plus headless deployment:

# model lifecycle
lms get llama-3.3-8b-instruct
lms ls                        # list installed models
lms load llama-3.3-8b-instruct
lms unload --all

# server lifecycle
lms server start --port 1234
lms server status
lms server stop

# headless daemon (Linux/macOS)
lms daemon up
lms daemon down

# interactive REPL
lms chat llama-3.3-8b-instruct

# log streaming
lms log stream

For scripted deploys (CI, Docker-on-Linux, on-prem servers) llmster is the daemon flavour: same engine, no GUI dependency. The official docs walk through registering it as a systemd unit so it survives reboots.

MLX vs llama.cpp on Apple Silicon

This is the question that drives most Mac LM Studio installs. The short version: on M-series Macs, MLX wins. Apple's framework talks to the GPU and Neural Engine directly; llama.cpp's Metal backend is a thinner translation layer. For most modern models (Llama 3.x, Qwen 2.5, Gemma 2, Mistral, Phi) MLX delivers 30 to 50 percent higher tokens-per-second with equal or lower memory use.

Three caveats:

  • MLX has narrower model coverage. Brand-new architectures often land in GGUF first.
  • MLX quant options are simpler (4-bit, 8-bit, fp16). GGUF's K-quants give you finer control.
  • On Intel Macs MLX is unavailable. The Apple Silicon requirement is hard.

Practical rule: download both formats if your Mac can hold them, then benchmark with the Chat tab's tokens-per-second counter. Pick whichever wins on your machine for your model.

LM Studio vs Ollama vs llama.cpp

The three tools overlap heavily but optimise for different defaults.

LM StudioOllamallama.cpp
Primary surfaceDesktop GUICLI + RESTC++ binary + REST
InstallInstaller (.dmg/.exe/.AppImage)One-line script or installerSource compile or release binary
Model registryHugging Face browserollama.com registry + HFManual GGUF download
Apple Silicon backendMLX + MetalMetalMetal
OpenAI-compatible serverYes (localhost:1234)Yes (localhost:11434)Yes (llama-server)
Headless daemonllmsterollama servellama-server
Document chat (RAG)Built-inVia Open WebUI etc.None
Docker imageNoYes (official)Yes (community)
Best forDesktop power user, GUI-first engCLI-first eng, container deploysMax throughput on a single GPU

Reach for LM Studio when you want a real graphical interface, MLX acceleration on a Mac, and an OpenAI-compatible endpoint without writing any code. Reach for Ollama when you live in the terminal and want Docker-native deploys. Reach for llama.cpp when you need every last token-per-second and are comfortable with C++ build flags. For production at scale (high concurrency, paged attention) skip all three and use vLLM.

Code completion with LM Studio

LM Studio is increasingly viable as a backend for local code completion. The shape is: load a coder-tuned model in LM Studio, start the OpenAI-compatible server, and point your editor's AI extension at http://localhost:1234/v1.

Two editor patterns work well today:

  • Continue.dev (VS Code, JetBrains). In ~/.continue/config.json set the models entry to provider: "openai" with apiBase: "http://localhost:1234/v1" and the model id matching what you loaded in LM Studio (e.g. qwen2.5-coder-14b-instruct).
  • Cursor. In Settings, override the OpenAI base URL to http://localhost:1234/v1. Cursor will route chat through your local model; Tab completion still needs Cursor's small remote model unless you switch it off.

Recommended models for code completion in mid-2026: Qwen2.5-Coder-14B for the best general code quality on a 32 GB machine, Qwen2.5-Coder-7B for laptops, DeepSeek-Coder-V2-Lite as a strong alternate. All ship in both GGUF and MLX. The Qwen 3.5 pillar covers the model side; this section covers the runtime.

One realistic caveat: local code models are still meaningfully behind Claude Sonnet, GPT-5.5 and Gemini 2.5 Pro on hard refactors and multi-file reasoning. Use LM Studio for offline work, sensitive codebases, or as a fast path for simple completions; reach for a frontier model when the task is hard.

MCP support and agentic workflows

Starting in LM Studio 0.3.x the app ships first-class MCP (Model Context Protocol) support. The Developer tab has an MCP Servers panel where you can paste config for any MCP server (filesystem, SQLite, web search, your own) and the loaded model will see those tools in its context.

This pulls LM Studio into the same world as Claude Desktop and Cursor: a local model with a real tool-use surface. Practical wins:

  • Wire the filesystem MCP server and ask a local Qwen2.5-Coder model to read and edit files on disk.
  • Add a SQLite MCP and run analytical queries against a local database via natural language.
  • Combine with web-search MCPs for offline-first research that fetches when needed.

Tool-use quality is a function of the model, not LM Studio. Qwen2.5, Llama 3.3 and DeepSeek-Coder-V2 all handle MCP function calling competently; smaller 3B-class models will struggle.

Common pitfalls

  • Out-of-memory crashes on large models. Watch the green/yellow badge in the model browser. A yellow badge means the quant will spill to swap, which murders throughput.
  • Slow first token after idle. LM Studio unloads models from VRAM after an idle timeout. Tune Keep model in memory in the Developer tab.
  • Server only accessible from localhost. By design. Tick Serve on local network in the Developer tab to bind to 0.0.0.0. Combine with a firewall rule.
  • MLX model missing for a brand-new architecture. Fall back to GGUF until the MLX community ports it. The lag is usually days to weeks.
  • Intel Mac users. macOS 14 + Apple Silicon is required for current LM Studio. Older 0.2.x builds support Intel but are no longer updated.

When NOT to use LM Studio

  • Production inference at scale. Use vLLM or TGI. LM Studio is a single-process, single-node tool.
  • Headless Linux server with no GUI ever needed. The llmster daemon works, but Ollama's Docker-first design is friendlier.
  • You want fine-grained CLI control over llama.cpp build flags. Use llama.cpp directly.
  • Multi-tenant deployment. LM Studio is single-user; there is no auth on the local API.

FAQ

Is LM Studio free?

Yes for personal use. The desktop app and the lms CLI cost nothing. A commercial tier exists for organisations bundling LM Studio into paid products.

Does LM Studio send my data anywhere?

No. All inference happens locally. The only network calls are to Hugging Face for model downloads and to LM Studio's update server for app updates. Both can be blocked at the firewall once you have the models you want.

What is the OpenAI-compatible server in LM Studio?

A REST API on http://localhost:1234 that mirrors OpenAI's /v1/chat/completions, /v1/completions, /v1/embeddings and /v1/models endpoints. Any OpenAI SDK or tool works against it by swapping the base URL.

How does LM Studio compare to Ollama?

Both wrap llama.cpp. LM Studio is GUI-first with MLX on Apple Silicon; Ollama is CLI-first with Docker and a model registry. LM Studio is friendlier for desktop power users; Ollama is friendlier for container and CI deployments.

Does LM Studio support MLX on Apple Silicon?

Yes. MLX is the default engine on M-series Macs when an MLX build of the model is available. It is typically 30 to 50 percent faster than the Metal/llama.cpp path.

Can I run LM Studio headless on a Linux server?

Yes. Install the AppImage, run lms daemon up and start the server with lms server start. The official docs include a systemd unit example.

Does LM Studio do RAG?

Yes. Attach a PDF, DOCX, Markdown or code file to a chat and LM Studio chunks and embeds it locally using nomic-embed-text-v1.5 by default. Long documents are retrieved into context per turn.

Which models can I run in LM Studio?

Any GGUF model on Hugging Face, plus MLX-format models on Apple Silicon. Llama 3.x, Qwen 2.5, DeepSeek, Mistral, Gemma 2, Phi-3.5 and dozens of fine-tunes are all one-click installs from the Discover tab.

What hardware do I need?

16 GB of RAM is the comfortable minimum for 7B/8B models at Q4. 32 GB unlocks 13B and 14B comfortably; 64 GB+ for 30B and larger. Apple Silicon's unified memory counts as VRAM; a 36 GB M3 Pro runs Llama 3.3 70B Q4 fine.

Is LM Studio open source?

The GUI app is closed source. The underlying engines (llama.cpp, MLX) are open source. The lms CLI and llmster daemon are closed-source binaries shipped with the app.

References