Apple Silicon LLMs: Complete Guide to Running Models on Mac (2026)

The complete Codersera guide to running large language models locally on Mac. MLX has won the Apple Silicon performance race; this is what to buy, install, download, and how to think about it.

Updated 31 May 2026 • 15 min read

Quick answer. In May 2026, the Mac is a credible local-LLM box. Apple Silicon's unified memory means a 64 GB MacBook runs models that won't fit on a 24 GB RTX 4090, and MLX — Apple's native ML framework — has become the fastest way to run them, beating llama.cpp by 30–40% on M5 hardware. For most users the right stack is Ollama 0.19+ (which now uses MLX under the hood on Apple Silicon) for everyday chat and agent work, MLX-LM directly when you want maximum performance or to fine-tune, and LM Studio when you want a GUI. A 32 GB M-class Mac runs 30B mixture-of-experts models at ~100 tokens/sec; a 64 GB Mac runs 70B at usable speeds; multi-Mac clusters over Thunderbolt 5 now run frontier 120B+ models for sovereign teams. This guide covers what to buy, what to install, which models to download, and how to think about quantization, fine-tuning, and clustering on Apple Silicon.

Why run LLMs locally on a Mac in 2026?

Three things changed in 2024–2026 that turned the Mac from a curiosity into the default local-LLM machine for individual developers and small teams.

Unified memory matured into an unfair advantage. On a discrete-GPU PC, the model has to fit in the GPU's VRAM — typically 8 to 24 GB on consumer hardware, and the model weights have to be copied across PCIe to get there. On Apple Silicon, the CPU and GPU share the same pool of memory at full bandwidth. A 64 GB MacBook Pro can load a 70-billion-parameter model at 4-bit quantization and start serving tokens in seconds, no copying. The 24 GB VRAM ceiling that limits a $1,600 RTX 4090 simply doesn't apply.

MLX won. Apple's open-source MLX framework, released late 2023, hit production maturity in 2025 and pulled decisively ahead of llama.cpp's Metal backend in 2026. On the M5 generation, MLX is 30–60% faster on most workloads and 3–4× faster on prompt processing thanks to the Neural Accelerators embedded in every GPU core. The Hugging Face mlx-community org now hosts ~4,800 pre-converted models. Ollama, the most popular local-LLM CLI, switched its Apple Silicon backend to MLX in version 0.19 in March 2026 — a quiet but enormous performance win for anyone already using it.

The model landscape moved. 2026 produced a wave of mixture-of-experts (MoE) open-weight models — Qwen 3.5 / 3.6 / 3.7, DeepSeek V4 Flash, Kimi K2.6, Mixtral families — that ship a small "active" parameter budget (3B–17B activated per token) on top of a large total parameter pool. They run fast on Mac because only the active experts have to be loaded into the working set, and Apple Silicon's bandwidth handles MoE routing well. The result: a 32 GB Mac running Qwen3-Coder-30B-A3B is the practical equivalent of a much bigger GPU box for most chat and code-completion workloads.

None of this means a Mac replaces a multi-GPU cloud setup for serving paying customers — it doesn't. But for privacy-sensitive work, offline use, agent prototyping, learning, and the entire "I want to use a strong model without an API bill" pattern, the Mac in 2026 is the answer.

What is the Apple Silicon advantage, exactly?

Three hardware properties drive everything:

1. Unified memory architecture (UMA). On the M-series chip, the CPU, GPU, and Neural Engine share a single pool of RAM with no copies between them. When you load a 40 GB model on a 64 GB Mac, every component can read those weights at full memory bandwidth — typically 400–500 GB/s on a Max-class chip, 800+ GB/s on Ultra. By contrast, an NVIDIA GPU has to copy weights from system RAM into VRAM across PCIe (~64 GB/s on PCIe 5.0 x16) before it can use them, and if the model doesn't fit in VRAM at all, it can't run.

2. Memory bandwidth that's actually high for the price. Token generation in transformers is memory-bandwidth-bound; almost no compute matters compared to "how fast can I read the weights for the next token." Bandwidth ranks roughly: M-class Pro ≈ 273 GB/s, Max ≈ 500 GB/s, M5 Max ≈ 614 GB/s, M2/M3 Ultra ≈ 800 GB/s. An RTX 4090 manages 1,008 GB/s but only on the 24 GB that fits inside it.

3. Neural Accelerators on M5. Apple's M5 chip embeds dedicated AI accelerators inside every GPU core (40 in M5 Max), and MLX uses them automatically. Apple's January 2026 paper showed Qwen3-14B at 4-bit running 4.06× faster on time-to-first-token and 1.19× faster on token generation versus M4 — and that's the accelerator contribution alone, before the bandwidth gain. For interactive chat and long-context agent workloads, M5 is a genuine generational step, not just a refresh.

The catch: bandwidth governs decode speed, accelerators govern prefill speed. M5's 3.5× advertised "AI gain" is mostly about prefill (the model reading your prompt). Token-by-token output speed scales with bandwidth, which only improved ~15% from M4 to M5. Don't expect a 3.5× speedup on every workload — expect a dramatic speedup on time-to-first-token for long prompts, especially in agent loops.

What is MLX and why did it win?

MLX (github.com/ml-explore/mlx) is an array-computation framework for Apple Silicon, built and open-sourced by Apple in late 2023. Think NumPy + autograd + a Metal backend, with Python and Swift APIs. Around it, mlx-lm (github.com/ml-explore/mlx-lm) provides LLM-specific tooling: model loading, generation, quantization, LoRA fine-tuning, and a tokenizer wrapper.

MLX's design choices that mattered: a lazy computation graph (no eager-mode overhead), zero-copy weights through unified memory (no CPU↔GPU shuttling), and a function-transform model (grad, vmap, compile) that lets the framework fuse operations into one Metal kernel. The result is consistently faster than llama.cpp's Metal backend on the same model + quant, with the gap widening on M5 because MLX explicitly uses the Neural Accelerators that llama.cpp doesn't.

Where it stood as of May 2026: mlx v0.31.x, mlx-lm v0.31.x, roughly a release every 3–4 weeks. MoE is fully supported (the kernel work landed in late 2025). Quantization tooling covers 3-bit through 8-bit plus mixed-precision modes (the mxfp8 and nvfp4 formats from the broader 4-bit-floating-point ecosystem). Distributed inference and training across multiple Macs via mx.distributed is production-shaped. The mlx-community Hugging Face org has ~4,810 converted models from ~4,830 community members — pick a model, append -4bit or -8bit, and it usually exists.

MLX also ships a CUDA backend (~9% of the codebase is CUDA-targeted), so the framework isn't strictly Apple-only — it can run on Linux + NVIDIA. Practically, you wouldn't use MLX on NVIDIA over vLLM or TensorRT-LLM, but it broke the "MLX locks you in" framing.

Pick your stack: MLX-LM, Ollama, LM Studio, llama.cpp, or vllm-mlx?

The Mac local-LLM stack used to be a confusing five-way choice. In 2026 it's much cleaner, because most of these now share MLX under the hood:

Tool	What it is	Right when
Ollama 0.19+	CLI + REST server, MLX backend on Apple Silicon since March 2026	Everyday chat, agent workflows, you want a one-line install and an OpenAI-compatible API. The default for most users.
MLX-LM	Apple's official Python CLI: `mlx_lm.generate`, `mlx_lm.server`, `mlx_lm.lora`	Maximum speed, fine-tuning, scripting in Python, custom inference loops. The right answer when Ollama is too opinionated.
LM Studio	Desktop GUI that wraps both GGUF (llama.cpp) and MLX backends, plus a model marketplace	You want a GUI, want to browse and download models visually, want MCP server support, or are setting up a non-technical user.
llama.cpp / GGUF	The cross-platform C++ reference, Metal backend on Mac	A model is brand-new and only has GGUF conversions yet, or you want truly portable inference code that runs on Mac/Linux/Windows from one binary.
vllm-mlx	vLLM's API and PagedAttention, MLX as the kernel layer	You're serving multiple concurrent users / agent fleets and need batched throughput. Worse single-user speed, much better total throughput.

Independent benchmarks on an M4 Pro 64 GB running DeepSeek V3 at Q4 in May 2026 showed: Ollama 0.19+ ≈ 58 tokens/sec single-user with ~45 ms time-to-first-token (the best for interactive chat); vllm-mlx ≈ 42 tokens/sec single-user but ~1,150 tokens/sec aggregate at 32 concurrent users (the best for agent fleets); llama.cpp Metal ≈ 52 tokens/sec, the slowest of the MLX-backed group but still the broadest model support. The "Ollama is slow on Mac" critique that was true through 2025 is now mostly false — upgrade to 0.19 or later.

Jan and GPT4All are also valid choices for the "fully offline desktop app, no telemetry" niche; both lag MLX-native tools on raw speed but are excellent for privacy-strict users.

What can I actually run on my Mac?

The honest buying-guide table — what runs comfortably at 4-bit (Q4) quantization, with expected tokens/sec on an M4 Max class chip via MLX. M5 numbers run ~15% faster on token generation and dramatically faster on time-to-first-token:

Unified RAM	Realistic ceiling	Representative model	Speed (M4 Max, Q4)
8 GB	3–4 B dense, small MoE	Gemma 4 E2B, Phi-3.5-mini	60–90 tok/s
16 GB	7–8 B dense	Llama 3.2 7B, Mistral 7B	50–70 tok/s
24 GB	13–14 B dense	Qwen 3.5 14B	30–45 tok/s
32 GB	30B MoE — the 2026 sweet spot	Qwen3-Coder-30B-A3B	~130 tok/s
48 GB	30–40 B dense	Command-R 35B	18–25 tok/s
64 GB	70 B Q4 comfortably	Llama 3.1 70B, Qwen3.5-122B-A10B	12–15 tok/s on 70B
96–128 GB	70 B bf16, 8x22B MoE	Mixtral 8x22B, DeepSeek V4 Flash (aggressive quant)	8–15 tok/s on 70B FP16
192 GB Ultra	405 B at Q2, DeepSeek V4 at Q3	Llama 3.1 405B Q2_K	3–6 tok/s
512 GB Studio Ultra	DeepSeek V4 at usable quant	DeepSeek V4 Q4	15–25 tok/s
Multi-Mac JACCL cluster	120B+ at higher precision	DeepSeek V4 bf16, Llama 4 400B	10–20 tok/s, scales with cluster

The sweet spot for most developers in May 2026 is a 64 GB M4 Max or 64 GB M5 Max. It runs the 30B-A3B MoE class at over 100 tokens/sec (faster than most cloud APIs for interactive work), runs 70B dense at acceptable speed, and leaves headroom for the context window — which matters more than people realise, because a 128 K-token context can easily eat 8–12 GB on top of the model weights.

If you have an 8 GB Mac, you can still run something useful: Gemma 4 E2B, Phi-3.5-mini, or a tiny Qwen 3.5 variant. These are not GPT-5.5-replacements, but they handle classification, summarisation, light coding assistance, and structured-output tasks well, and they run completely offline.

Which models should I download right now?

The May 2026 picks, organised by what you're trying to do. All have native MLX conversions in mlx-community:

Best general-purpose chat model on 32 GB: Qwen 3.6 35B-A3B MoE. Smart, fast, multilingual, instruction-following is on par with hosted Claude Sonnet for everyday tasks.

Best coding assistant on a Mac: Qwen3-Coder-30B-A3B at 4-bit. Roughly 130 tok/s on a 64 GB M4 Pro, ~230 tok/s on M5 Max. The state of the art for offline code completion in 2026. See our AI coding agents guide for context on how it compares to Claude Code and Cursor.

Best reasoning model on 64 GB: DeepSeek V4 Flash at Q4 (284B total / 13B active MoE). Released late April 2026 with MIT license. Punches above its weight class on math and coding benchmarks. The full DeepSeek V4 Pro (1.6T total) only fits on a 512 GB Mac Studio Ultra or a small cluster — see our DeepSeek V4 guide.

Best 70B-class workhorse on 96 GB: Llama 4 Maverick at Q4. The most-deployed local-LLM size for serious work. Stable, well-tuned, fine-tunable. See our Llama 4 guide.

Best long-context model: Kimi K2.6 at aggressive quant on Mac Studio Ultra, or Llama 4 Scout for its 10M-token context. Kimi K2.6 leads agentic benchmarks. See our Kimi K2.6 guide.

Best tiny model for 8–16 GB Macs: Gemma 4 E2B / E4B. Google's small-model family, multimodal, 256K context. See our Gemma 4 guide.

Best vision-capable model: Qwen3-VL-30B-A3B. ~68 tok/s on M4 Max, 32 GB minimum. The MLX conversion handles continuous batching for vision through vllm-mlx if you need it.

The full open-weights landscape, including Qwen, Llama, DeepSeek, Mistral, Gemma, and the rest, is mapped in our open-source LLMs landscape pillar. The infrastructure side — cloud GPU runners, Kubernetes deployment, multi-tenant serving — is in our self-hosting LLMs guide.

How should I think about quantization on Mac?

Quantization compresses weights from 16-bit floats down to lower-bit representations, trading some accuracy for huge memory savings. The two formats that matter on Mac:

MLX native quants (-4bit, -8bit, mixed precision). The default. mlx_lm.convert --hf-path X --q-bits 4 converts any Hugging Face model. Quality at 4-bit on 7B+ models is within 1–2% of bf16 on standard benchmarks. Sub-3B models suffer noticeably at 4-bit — prefer 8-bit there.

GGUF (Q2_K through Q8_0, plus K-quants and IQ-quants). The cross-platform standard, used by llama.cpp, Ollama (the legacy backend), and LM Studio. Q4_K_M is the safe default for 13B+ models — within ~1% of FP16 on quality. Q5_K_M for 7B and smaller. Q8_0 if you have the RAM headroom and want near-lossless. Below Q4_K_M, use importance-matrix quantization (imatrix) variants, otherwise quality drops noticeably.

What to skip on a Mac: AWQ, GPTQ, EXL2. All three are NVIDIA tensor-core-optimised; no Mac tool supports them natively. The few "AWQ on Mac" claims you'll see online are running the underlying base weights through a conversion, not the AWQ format itself.

The general rule for Mac users in 2026: start at MLX 4-bit. Move to 8-bit if your model is under 7B or you notice quality issues. Move to bf16 if you have memory headroom and care about long-form reasoning quality. Only reach for GGUF when the model you want hasn't been converted to MLX yet (the gap is usually a few days).

Can I run frontier models by clustering multiple Macs?

Yes, and this is the most interesting 2026 development for sovereign-AI teams. The pattern: connect 2–4 Macs over Thunderbolt 5, install MLX, run distributed inference across the cluster. A small group of Mac Studio Ultras now serves DeepSeek V4 at higher precision than any single machine could fit.

The enabling technology is JACCL — a distributed backend Apple shipped with macOS 26.2 that runs MLX collectives over RDMA on Thunderbolt 5, hitting 50–60 Gbps with sub-50 µs latency. An order of magnitude lower latency than the previous ring-allreduce backend. Requires a fully-connected Thunderbolt topology (every Mac directly cabled to every other Mac), which limits practical cluster sizes to 4–6 nodes.

The community tool that wraps this is EXO, which auto-discovers Macs on the local network and distributes a single model across them, transparently. Spin up two Mac Studio Ultras, point EXO at them, and you have a single-binary serving endpoint for a 400B+ MoE that wouldn't fit on either individually.

When clustering is worth it: you're a small team, you need data sovereignty (the entire model and prompt history never leaves your office), you'd otherwise be on a $50K/mo OpenAI bill, and you can amortise the hardware (~$30K–$60K depending on configuration) over a year. When it's not worth it: you're a solo developer; rent an H100 on Lambda for $2.50/hour instead.

What about Apple's own on-device LLM (Foundation Models)?

In WWDC 2025, Apple shipped the Foundation Models framework in iOS / iPadOS / macOS 26 — a ~3 B-parameter on-device LLM with a Swift API. It powers Apple Intelligence behind the scenes. From a developer's perspective, you call a Swift function, you get tool-using LLM output, and nothing leaves the device.

Strengths: free, offline, private, optimised for Apple Silicon, no install. Weaknesses: closed weights, Swift-only, fixed size, modest quality (3 B parameters tuned for on-device, not for hard reasoning). It's not a replacement for an MLX-served open-weight model — it's a complementary option when you ship a Mac/iOS app and want a default model that just works.

WWDC 2026 (June 8–12) is widely expected to expand the framework — bigger on-device models, better Swift / Xcode integration, possibly a "Core AI" replacement for Core ML. If that ships, an Apple Foundation Models pillar is the natural follow-up to this guide.

Can I fine-tune on my Mac?

Yes, with caveats. MLX-LM ships LoRA, QLoRA, and DoRA fine-tuning natively, with realistic memory footprints:

7–8 B model: QLoRA at 4-bit fits in 7–8 GB working memory — runs on a 16 GB MacBook Pro
13–14 B: 14–18 GB working memory, comfortable on 32 GB
32 B at QLoRA: ~20–25 GB, possible on 48 GB+
70 B: requires 96 GB+ unified memory; tight but works

Real-world example: a Mistral 7B QLoRA run on 5,000 examples takes ~90 minutes on an M2 Max 32 GB, peak ~7 GB RAM. The unified-memory trick again — a 32 GB Mac fine-tunes models that OOM an RTX 3090's 24 GB VRAM. The tradeoff: NVIDIA still trains 2–4× faster on whatever fits on it, so for serious multi-day fine-tuning work, rent a cloud H100.

The CLI is dead simple: mlx_lm.lora --model mlx-community/Llama-3.1-8B-Instruct-4bit --train --data ./data --iters 1000. Our fine-tuning LLMs guide covers the practical recipe — dataset format, LoRA hyperparameters, evaluation — in depth.

How do I get started in five minutes?

Path A — Ollama (recommended for most users).

Install: brew install ollama then ollama serve &
Pull a model: ollama pull qwen3-coder:30b
Chat: ollama run qwen3-coder:30b
For programmatic use, hit the OpenAI-compatible endpoint at http://localhost:11434/v1

Path B — MLX-LM directly (recommended for max performance and fine-tuning).

Install: pip install mlx-lm
Generate: mlx_lm.generate --model mlx-community/Qwen3-Coder-30B-A3B-4bit --prompt "Write a Python script that..."
Serve an OpenAI-compatible API: mlx_lm.server --model mlx-community/Qwen3-Coder-30B-A3B-4bit
Fine-tune: mlx_lm.lora --model <model> --train --data <path> --iters 1000

Path C — LM Studio (recommended for GUI users). Download from lmstudio.ai, hit the model marketplace, filter by "MLX" or "Apple Silicon", click download, click load, chat. Includes an OpenAI-compatible local server you can flip on with one click.

FAQ

Is MLX faster than llama.cpp on Mac?

Yes, in 2026, by 30–60% on most workloads on M4 and M5 hardware. The gap is widest on prompt processing (time-to-first-token), where MLX uses Apple's Neural Accelerators that llama.cpp doesn't. On older M1 and M2 chips the gap is smaller. For any new Mac you buy in 2026, default to MLX-backed tooling (Ollama 0.19+, MLX-LM, vllm-mlx, or LM Studio's MLX backend).

How much RAM do I really need for serious local-LLM work?

32 GB is the entry point for genuinely useful work — it runs the 30B-A3B MoE class at ~100 tok/s, which feels like a cloud API. 64 GB is the sweet spot — adds 70B dense models and gives breathing room for context. 96–128 GB is for power users who want bf16 70B or larger MoEs. Anything below 32 GB confines you to small models that are good for narrow tasks but won't replace a hosted API.

Should I buy a MacBook or a Mac Studio for local LLMs?

MacBook Pro 14"/16" with M4 Max 64 GB if you need portability — it's the best laptop money can buy for LLM work. Mac Studio M4 Max 64 GB if you want a desk machine at lower cost. Mac Studio M2/M3 Ultra 192 GB only if you need to run 100B+ models locally; the cost-per-token is much better than buying a discrete NVIDIA setup at the same scale, but you're paying for the unified memory pool.

Can I run Claude or GPT on my Mac?

No — neither Anthropic nor OpenAI release weights, so their models can only be accessed via their respective APIs. The best open-weight alternatives in May 2026 are DeepSeek V4 (rivals GPT-5.5 on many benchmarks), Kimi K2.6 (rivals Claude Opus 4.6 on agentic tasks), and Qwen 3.6 / 3.5 (the strongest all-around open-weights family). For an honest comparison see our open-source LLMs landscape pillar.

What's the difference between MLX 4-bit and GGUF Q4?

Both compress a 16-bit model to roughly 4 bits per weight. MLX's 4-bit format is optimised for Apple's GPU shaders and runs faster on Apple Silicon. GGUF Q4_K_M is portable across platforms. Quality is comparable for 7B+ models. Pick MLX if the model has been converted (most have); pick GGUF only when the model is too new to have an MLX variant.

Does the M4 vs M5 difference matter for LLMs?

Yes, for prefill speed. M5 is dramatically faster at processing long prompts (3–4× faster time-to-first-token in Apple's published benchmarks) because of the Neural Accelerators. Token-by-token generation is only ~15% faster because that's memory-bandwidth-bound and bandwidth only grew modestly. If you do a lot of long-context work (RAG, agent loops, code repos), M5 is worth the upgrade. If you do short chat, the difference is smaller.

Can I use a Mac as the primary serving box for a SaaS app?

Honest answer: usually no. A Mac handles 1–10 concurrent users gracefully via vllm-mlx; beyond that you want a cloud GPU. The Mac is excellent for internal tools, agent prototypes, dogfooding, and small-team deployments. For production serving at scale, see our self-hosting guide for the cloud/Kubernetes path.

What about training a model from scratch on my Mac?

Not realistic at meaningful scale — you'd need weeks to months of GPU time even for a small model, and a Mac Studio Ultra doesn't beat a single rented H100 for raw FLOPS. The right path for "I want to train a model" is cloud H100/H200 spot instances. See our guide on self-training small LLMs for the realistic recipe — what fits, what costs what, and when nanochat is the right starting point.

How do I keep MLX up to date?

pip install --upgrade mlx mlx-lm once a month is enough. Ollama updates itself when you run brew upgrade ollama; LM Studio prompts you in-app. The model side moves faster than the framework side — check mlx-community on Hugging Face weekly for new conversions.

What's missing or coming next?

WWDC 2026 (June 8–12) is the next big inflection point. Expected: Foundation Models framework v2, deeper Xcode integration ("Xcode Intelligence"), possibly a Core AI replacement for Core ML, and likely larger on-device models. The macOS 27 release in autumn 2026 will probably ship more JACCL improvements for clusters. On the open-weights side, expect Llama 5 (probably 2027 per Meta), Qwen 4, and continued MoE growth.

Self-hosting LLMs in the cloud — the Linux + Kubernetes + multi-GPU side of the same coin
Fine-tuning LLMs — practical recipes for LoRA / QLoRA / DoRA / MLX-LoRA / Unsloth / Axolotl
Self-training a small LLM from scratch — when pre-training a model yourself actually makes sense
Open-source LLMs landscape — the full family of open-weight models in 2026
DeepSeek V4 complete guide
Qwen 3.5 complete guide
Llama 4 complete guide
Gemma 4 complete guide
Kimi K2.6 complete guide
AI coding agents (Claude Code, Cursor, Copilot)