How to Run GLM-5.2 Locally — Hardware, Quants, and Setup
Quick answer. GLM-5.2 is a 744B-parameter Mixture-of-Experts model from Z.ai. Unsloth's 2-bit dynamic GGUF compresses it from 1.51 TB down to about 239 GB, so it can plausibly run on a 4x RTX 3090 rig with 192 GB system RAM, a 256 GB+ Mac Studio, or a high-RAM dual-socket workstation. Expect roughly 3 to 9 tokens per second on consumer hardware.
Why you might want to run GLM-5.2 locally
GLM-5.2 launched on June 13, 2026 with open weights under the MIT license, a 1M-token context window, and benchmark scores that put it in serious contention with closed frontier models (Pandaily, June 2026). For the deeper background on what the model is and how it compares to GPT-5.5, Claude Opus 4.7, and DeepSeek V4, see our GLM-5.2 pillar guide. This post is the hands-on companion: which rig you actually need, which Unsloth quant to download, and the exact commands to get a first token out.
Self-hosting makes sense in three situations:
- Data residency — your prompts and codebase never leave the box.
- Cost ceiling — heavy agent loops at the Z.ai pay-per-token rate of $1.40 in / $4.40 out per million tokens (AI Pricing Guru, 2026) compound fast.
- Fine-tuning — the MIT license permits commercial derivatives. Cloud APIs cannot give you that.
If none of those apply, the Z.ai cloud API is almost certainly the right starting point — skip to that section.
TL;DR — minimum requirements by quant level
The figures below are from Unsloth's official GLM-5.2 documentation (Unsloth, 2026) and a community self-hosting guide published June 17, 2026 (ofox.ai):
| Quant | Disk / RAM footprint | Reported minimum hardware |
|---|---|---|
| UD-IQ1_S (1-bit) | ~217 GB | 1x 24 GB GPU + 192 GB RAM with MoE offload |
| UD-IQ2_M (2-bit) | ~239 GB | 256 GB unified-memory Mac, or 1x 24 GB GPU + 256 GB RAM |
| 3-bit dynamic | ~290–360 GB | Dual high-VRAM GPU + 384 GB RAM, or 384 GB+ Mac |
| Q4_K_M / UD-Q4_K_XL | ~372–476 GB | 4x H100 80 GB (tight) or workstation with 512 GB RAM |
| UD-Q5_K_XL | ~570 GB | Multi-node H100 / H200, generally lossless |
| Q8_0 | ~810 GB | 8x H200 141 GB cluster |
| BF16 (full) | ~1.51 TB | 16x H100 80 GB or 8x H200 141 GB |
Unsloth describes the 4-bit and 5-bit dynamic variants as generally lossless against the BF16 baseline, while the 2-bit dynamic quant retains roughly 82% of accuracy at 16% of the original size (AI Weekly, June 2026).
Quant table — what to actually download
For everyone outside a research lab, the practical choice is between three Unsloth GGUFs:
| File | Approx. size | Estimated speed | Recommended rig |
|---|---|---|---|
| UD-IQ2_M (2-bit dynamic) | ~239 GB | ~3–9 tok/s on consumer hardware (DEV / ComputeLeap, June 15 2026) | 256 GB Mac Studio, or 4x 3090 + 192 GB RAM |
| UD-Q4_K_XL (4-bit dynamic) | ~376 GB | Not separately documented; expect higher single-stream quality, similar throughput on offload-bound setups | 4x H100 80 GB tight, or 512 GB-RAM workstation |
| Q8_0 (8-bit static) | ~810 GB | Per-token speed dominated by interconnect; benchmark on your own rig | 8x H200 141 GB cluster |
One reference benchmark published with the Unsloth release reports approximately 8.7 tok/s on a single H200 with the 2-bit dynamic GGUF (DEV / ComputeLeap, June 15 2026). We have not independently verified consumer-GPU tok/s figures for GLM-5.2 specifically — the 3–9 tok/s envelope above is Unsloth's stated expectation for consumer hardware, and your real number will depend on context length, prompt-processing strategy, and how aggressively you offload MoE experts.
Hardware path 1: 4x RTX 3090 + 192 GB RAM
This is the most common serious local rig in 2026: 96 GB of pooled VRAM, ample DDR4/DDR5 host RAM, and a price that, used, sits in the $5–7k range for the GPUs alone. With the 2-bit dynamic GGUF, the model file is small enough that the remainder lives in host RAM and llama.cpp shuffles MoE experts on demand.
- OS: Ubuntu 22.04 or 24.04. Driver 550+, CUDA 12.4+.
- NVLink not required; PCIe 4.0 x16 per card is the practical floor.
- Storage: a single 1 TB NVMe is comfortable. Read speed matters at first load; not during inference.
- PSU: 4x 3090 pulls ~1,400 W under load. Power-limit to 280 W per card if you only have a 1,600 W PSU.
Build llama.cpp with CUDA, point it at the GGUF, and set --n-gpu-layers high enough that the dense layers are GPU-resident while the experts spill to host RAM. The Unsloth docs recommend the --ot "exps=CPU" override to pin all expert layers to CPU; this preserves prompt-processing speed on the GPU.
Hardware path 2: Mac Studio M3 Ultra (256 GB+)
Apple Silicon is the cleanest local path for a single user. The 2-bit Unsloth GGUF fits inside a 256 GB unified-memory Mac Studio, and llama.cpp's Metal backend handles the model end-to-end. Reported throughput on this configuration is in the 3–9 tok/s range, which is enough for solo coding-agent work but not enough for a multi-developer team (ofox.ai, June 17 2026).
A 512 GB M3 Ultra unlocks the 4-bit dynamic GGUF, which Unsloth describes as generally lossless against BF16. Ivan Fioravanti (@ivanfioravanti) has demonstrated MLX batch-generation improvements on the M3 Ultra 512 GB platform for other large MoE models; we have not seen a public GLM-5.2-specific tok/s number from that rig yet, so plan on benchmarking your own.
Hardware path 3: CPU-only dual-socket workstation
A dual Xeon Sapphire Rapids or EPYC Genoa box with 768 GB DDR5 can run the 2-bit and 4-bit dynamic GGUFs entirely on the CPU. The MoE architecture is the reason this is even feasible: with only ~40B parameters active per token, the bottleneck is memory bandwidth, not raw compute (Latent.Space, June 2026).
Expectations:
- Throughput: single-digit tok/s. Documented expectations are in the same 3–9 tok/s envelope as Mac Studio (DEV / ComputeLeap, June 15 2026); we have not seen a verified dual-Xeon GLM-5.2 benchmark and recommend timing your own run.
- Prompt processing: noticeably slower than GPU paths. A 64k-token prompt can take several minutes to ingest.
- Power: ~600 W steady. Quieter, simpler, and easier to colocate than 4x 3090.
This path makes sense when you already own the workstation. Building one from scratch only to run GLM-5.2 is hard to justify over the Mac Studio or 4x 3090 routes.
Hardware path 4: Z.ai cloud API
If you have no local rig, the Z.ai cloud API is the fastest start. Two flavors are available:
- Pay-per-token: $1.40 / 1M input tokens, $0.26 / 1M cached input, $4.40 / 1M output tokens (AI Pricing Guru, June 2026).
- GLM Coding Plan subscription: roughly $3–6/mo (Lite), $15–19/mo (Pro), or $80/mo (Max), with an Anthropic-compatible endpoint that drops into Claude Code or any Anthropic SDK client by overriding
ANTHROPIC_BASE_URLandANTHROPIC_API_KEY(Nerova, June 2026).
The Coding Plan is what most developers actually want — a flat monthly bill, no token math, and full access to the 1M-token context. Start there, profile your usage for a month, and only consider self-hosting once you know whether your workload would burn through Max-tier limits.
Step-by-step install — llama.cpp + Unsloth GGUF
These steps target the 4x 3090 path on Ubuntu 22.04. Mac Studio users can skip the CUDA flags and pass -DGGML_METAL=ON instead.
# 1. Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# 2. Install huggingface-cli and pull the 2-bit dynamic GGUF
pip install "huggingface_hub[cli]"
huggingface-cli download unsloth/GLM-5.2-GGUF \
--include "UD-IQ2_M/*" \
--local-dir ~/models/glm-5-2
# 3. Launch llama-server with MoE offload
./build/bin/llama-server \
--model ~/models/glm-5-2/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
--ctx-size 32768 \
--n-gpu-layers 999 \
--ot "exps=CPU" \
--threads $(nproc) \
--host 0.0.0.0 --port 8080
The --ot "exps=CPU" regex routes every Mixture-of-Experts layer to the host CPU while keeping attention and dense layers on the GPU. This is the configuration Unsloth documents for 24–96 GB VRAM rigs; if you have less host RAM, drop --ctx-size first.
Verify the server is up:
curl -s http://localhost:8080/v1/models | jq .
Verifying the install
A sanity-check prompt that exercises both reasoning and code:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2",
"messages": [
{"role": "user", "content": "Write a Python function that returns the n-th Fibonacci number using iterative DP. Then explain the time and space complexity in one sentence each."}
],
"max_tokens": 400
}' | jq -r '.choices[0].message.content'
You should see a clean iterative function plus two complexity statements (O(n) time, O(1) space) within ~30–60 seconds on a 4x 3090 + 192 GB rig. If output begins inside that window but generation is below 1 tok/s, you are almost certainly bound by host-RAM bandwidth — the next section covers common failure modes.
Common errors and fixes
Out of memory at load. Drop one quant level (1-bit if you were on 2-bit), lower --ctx-size to 16384, or add swap. A 256 GB swap file on NVMe is a legitimate temporary workaround, though it cuts throughput by 30–50%.
First token takes 60+ seconds. Prompt processing on long contexts is dominated by attention. Build llama.cpp with -DGGML_CUDA_FORCE_MMQ=ON and pin attention to GPU by lowering --n-gpu-layers only enough to fit, never higher than necessary.
Generation is below 2 tok/s. Check nvtop — if GPU utilization is near zero during generation, MoE traffic is saturating PCIe. Move the model file to NVMe (not SATA SSD), and ensure your --ot override is actually matching by checking the load log for "tensor blk.X.ffn_*_exps.weight will be assigned to CPU."
Kernel panic on Mac Studio. macOS pages aggressively when an app touches more than ~85% of unified memory. Set sudo sysctl iogpu.wired_limit_mb=$((220*1024)) to wire 220 GB for the GPU before launching llama.cpp. Restart resets the limit.
Z.ai endpoint returns 401. The Anthropic-compatible endpoint requires ANTHROPIC_AUTH_TOKEN, not ANTHROPIC_API_KEY, for the Coding Plan flow. The OpenAI-compatible endpoint at /api/coding/paas/v4 uses standard Authorization: Bearer headers.
FAQ
Can I run GLM-5.2 on a single RTX 4090?
Not in any practical sense. The 2-bit dynamic GGUF is 239 GB and a 4090 has 24 GB VRAM, so at least 215 GB has to come from somewhere else. With 256 GB of fast host RAM and MoE offload it will technically run, but throughput drops to roughly 1–3 tok/s and prompt processing on long contexts becomes painful.
Is the 2-bit Unsloth quant good enough for coding work?
For most workflows yes. The 2-bit dynamic GGUF retains around 82% of full-precision accuracy according to Unsloth's measurements (AI Weekly, June 2026). If you need lossless behavior, jump to UD-Q4_K_XL or UD-Q5_K_XL, both of which Unsloth describes as generally lossless against the BF16 baseline.
How does this compare to running DeepSeek V4 or Kimi K2.6 locally?
All three are large MoE models in the same self-hosting bracket. DeepSeek V4's full-precision footprint is similar; Kimi K2.6 sits slightly larger. See the DeepSeek V4 guide and the Kimi K2.6 guide for model-specific quant notes. For broader background on open-weight MoE inference, the open-source LLMs landscape covers the field.
Does GLM-5.2 work with Ollama and LM Studio?
Yes. Unsloth's documentation lists llama.cpp, Ollama, LM Studio, and vLLM as supported runtimes for the dynamic GGUFs (Unsloth, 2026). Ollama wraps llama.cpp under the hood, so the throughput will be the same as a direct llama.cpp build with equivalent flags. LM Studio is the easiest GUI path on macOS.
What about agentic workflows — does GLM-5.2 play well with coding agents?
The model is positioned by Z.ai as a coding-first frontier model with native tool-use and a 1M-token context (Latent.Space, June 2026). The Coding Plan endpoint is Anthropic-compatible, which means Claude Code, Cline, and other Anthropic-SDK agents work out of the box. For a broader review of which agents work best with which models, see our AI coding agents guide.
What to read next
- GLM-5.2 complete guide — model architecture, benchmarks, pricing, comparisons.
- DeepSeek V4 guide — the other big open-weight MoE worth considering.
- Kimi K2.6 guide — Moonshot's frontier release and how it stacks up.
- AI coding agents — pairing GLM-5.2 with Claude Code, Cline, Cursor, and friends.
- Self-hosting LLMs guide — broader patterns for production self-hosted inference.
Need engineers who have already shipped this?
Setting up local GLM-5.2 on a 4x 3090 rig with the right MoE-offload flags, NVMe pinning, and llama-server tuning is the easy half. The harder half is wiring it into a real product — retrieval, evaluation harnesses, agent loops that don't run away with the context window. Hire vetted remote engineers from Codersera who have already done this work on production systems. Risk-free trial, fast technical fit, and we match on the actual stack you are running.