How to Install and Use GLM-4.7 — Setup Guide (April 2026)

Learn how to install and run GLM-4.7 locally or via API. Complete guide with benchmarks, pricing comparisons, step-by-step installation for 5 methods, and hands-on examples.

How to Install and Use GLM-4.7 — Setup Guide (April 2026)
Install and Use GLM-4.7

Last updated April 2026 — refreshed for current model/tool versions, pricing, and the GLM-4.7 / GLM-4.7-Flash / GLM-5 ecosystem.

GLM-4.7 is Z.ai's December 2025 open-weights flagship: a 358B-parameter Mixture-of-Experts model with about 32B active per token, MIT-licensed, and competitive with Claude Sonnet 4.5 on real coding benchmarks. This guide is the practical, sources-cited setup reference for running it — locally on llama.cpp/Ollama, on a multi-GPU rig with vLLM or SGLang, or via API — and for picking between the full GLM-4.7, the lighter GLM-4.7-Flash (30B-A3B), and the newer GLM-5 family that shipped in February 2026.

What changed since the original post (December 2025 → April 2026)GLM-4.7-Flash shipped on January 20, 2026 — a 30B-A3B MoE that runs on a single 24 GB GPU and scores 59.2 on SWE-bench Verified. It is currently free via the Z.ai API.GLM-5 launched on February 11, 2026: a ~745B-parameter MoE (44B active, 256 experts), trained on Huawei Ascend hardware, with new pricing tiers replacing the $3/month promo. The Coding Plan now starts at ~$10/month (Lite $30/quarter, Pro $90/quarter, Max $240/quarter, with Q2 2026 discounts of 10%).Z.ai API pricing for GLM-4.7 is now $0.60/M input ($0.11/M cached) and $2.20/M output — significantly lower than the $2.80/M figure from launch posts.Cerebras serves GLM-4.7 at ~922–1,500 tokens/sec (Artificial Analysis measures 922 tok/s; Cerebras' own marketing claims up to 1,500), making it the fastest hosted option by ~20× over typical GPU providers.Ollama, llama.cpp, vLLM, and SGLang all have first-class GLM-4.7 support on their main branches; tool-call parser is now glm47 (not glm4_moe) and the reasoning parser remains glm45.Unsloth GGUFs ship dynamic quants from 1-bit (UD-TQ1_0, ~75 GB) to 8-bit, with UD-Q2_K_XL (~135 GB) the recommended local trade-off. The 1-bit variant runs in Ollama on a single high-RAM workstation.

Want the full picture? Read our continuously-updated Open-Source LLMs Landscape (2026) — every notable open-weights model, license, and hosting cost.

TL;DR — which GLM should you run?

Use caseBest pick (April 2026)Why
Cheapest hosted coding agentGLM-4.7 via Z.ai Coding Plan ($30/quarter)Tight integration with Claude Code, Cline, Roo Code; 87.4 on τ²-Bench tool use.
Fastest hosted GLM-4.7Cerebras Inference~922 tok/s output (Artificial Analysis); 20× faster than typical GPU clouds.
API on a budget, pay-per-tokenOpenRouter z-ai/glm-4.7$0.38/M in, $1.74/M out, ~203K context, automatic provider failover.
Local on one 24 GB GPUGLM-4.7-Flash (30B-A3B)Free on Z.ai; 59.2 SWE-bench; 60–220 tok/s on RTX 3090/4090.
Local on a workstation (128 GB+ RAM)GLM-4.7 UD-Q2_K_XL via llama.cpp~135 GB on disk; fits a 24 GB GPU + 128 GB RAM with MoE offloading.
Frontier open weights, no expense sparedGLM-5 (745B, 44B active)State-of-the-art open model as of Feb 2026; needs ≥4× H100 to self-host.

What GLM-4.7 actually is

GLM-4.7 is the December 22, 2025 release from Z.ai (the rebranded Zhipu AI / THUDM). It is a Mixture-of-Experts model with 358B total parameters, ~32B active per forward pass, and a 200K-token context window (203K via OpenRouter; 131,072 max new_tokens per response in the official chat template). The weights are MIT-licensed, published under zai-org/GLM-4.7 on Hugging Face, with FP8 and BF16 variants. The technical report is on arXiv as 2508.06471.

The headline 2025 benchmarks (from the official model card and Z.ai's launch announcement):

  • SWE-bench Verified: 73.8% (+5.8 over GLM-4.6)
  • SWE-bench Multilingual: 66.7% (+12.9 over GLM-4.6)
  • LiveCodeBench v6: 84.9
  • Terminal Bench 2.0: 41.0 (+16.5)
  • τ²-Bench (tool use): 87.4 — highest among open-source models on launch
  • AIME 2025: 95.7
  • HLE (with tools): 42.8 (+12.4)
  • BrowseComp: 67.5

The model introduces three reasoning modes that matter for agent code:

  • Interleaved Thinking — reasoning tokens emitted before each response and before each tool call.
  • Preserved Thinking — keeps the reasoning chain across turns instead of resetting it (a meaningful change for long agent sessions).
  • Turn-level Thinking — toggle reasoning depth per turn via chat_template_kwargs.

Path 1: Use GLM-4.7 via the Z.ai API (5 minutes)

If you do not need on-prem weights, this is the path of least resistance. Z.ai exposes an OpenAI-compatible endpoint, so any client that speaks Chat Completions works.

Get an API key

  1. Sign up at chat.z.ai or docs.z.ai.
  2. Create an API key from the developer console.
  3. Pricing as of April 2026 (per the official pricing page):
    • GLM-4.7 — Input $0.60/M, Cached input $0.11/M, Output $2.20/M
    • GLM-4.7-Flash — currently free for input, cached input, and output
    • GLM-4.6 — same as 4.7 ($0.60 / $2.20)

Minimal Python client

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_ZAI_KEY",
    base_url="https://api.z.ai/api/paas/v4",
)

resp = client.chat.completions.create(
    model="glm-4.7",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Write a Python LRU cache with TTL."},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    temperature=1.0,
    top_p=0.95,
)
print(resp.choices[0].message.content)

To turn the chain-of-thought off (cheaper, faster, fewer reasoning tokens), set enable_thinking: false in chat_template_kwargs.

Z.ai Coding Plan vs OpenRouter vs Cerebras

ProviderPricing (per M)SpeedNotes
Z.ai pay-per-token$0.60 in / $2.20 out~50–80 tok/sDirect API, simplest billing.
Z.ai Coding Plan (Lite/Pro/Max)$30 / $90 / $240 per quarterQuarterly subscription with usage caps; ties into Claude Code, Cline, Roo Code, OpenCode.
OpenRouter$0.38 in / $1.74 outvaries by upstreamMulti-provider failover; 203K context.
Cerebras Inferencepremium tier~922 tok/s (AA), up to 1,500 (vendor)Wafer-scale ASIC; for latency-sensitive agent loops.

Path 2: Run GLM-4.7-Flash locally on a single GPU

If you have a 24 GB consumer GPU (RTX 3090, RTX 4090, RTX 4500 Ada, M-series Mac with ≥32 GB unified memory), GLM-4.7-Flash is the right model. It is a 30B-A3B MoE — 30B total, ~3.6B active — released January 20, 2026, MIT-licensed, with the same chat template and tool-call parser as GLM-4.7. Real-world numbers:

  • SWE-bench Verified: 59.2 (vs 22 for Qwen3-30B-A3B-Thinking, 34 for GPT-OSS-20B).
  • RTX 3090, Q4_K_M, 4K context: ~93 tok/s decode, ~2,000 tok/s prefill.
  • RTX 3090, Q4_K_M, 32K context: ~43 tok/s decode, ~620 tok/s prefill.
  • Loads ~23 GB VRAM at 65K context with 4-bit quant.

Quickest setup with Ollama

# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Pull GLM-4.7-Flash (Q4_K_M is the default sweet spot)
ollama pull glm-4.7-flash

# Run it
ollama run glm-4.7-flash

Then point any OpenAI-compatible client (Continue, Cline, Aider, OpenCode) at http://localhost:11434/v1.

llama.cpp directly (more knobs)

# Build llama.cpp from source for the latest GLM-4.7 changes
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Pull a quant from Unsloth or Bartowski
huggingface-cli download unsloth/GLM-4.7-Flash-GGUF \
  --include "*Q4_K_M*" --local-dir ./glm47flash

# Serve it (OpenAI-compatible endpoint on :8080)
./build/bin/llama-server \
  -m ./glm47flash/GLM-4.7-Flash-Q4_K_M.gguf \
  --jinja \
  --temp 1.0 --top-p 0.95 \
  -c 32768 \
  -ngl 99

Two non-obvious flags that catch people:

  • --jinja is mandatory — without it the GLM-4.7 chat template is not applied and tool calls and thinking blocks come out malformed.
  • -ot ".ffn_.*_exps.=CPU" offloads only the MoE expert layers to system RAM while keeping attention/embeddings on GPU. Use it when you cannot fit the full quant in VRAM.

Path 3: Run the full GLM-4.7 (358B) locally

This is realistic only with a 24 GB GPU + 128 GB RAM (or more), an Apple Silicon Mac with ≥256 GB unified memory, or a multi-GPU rig. Even then, expect 1–20 tokens/sec on most consumer setups. The Hacker News thread on the launch is candid about this: a dual Xeon with 768 GB RAM gets ~1–2 tok/s; an M3 Ultra gets ~20 tok/s on a 4-bit quant. For interactive use, the API or Cerebras wins on cost-per-token.

If you still want to self-host, use Unsloth's dynamic GGUFs:

QuantDiskMin VRAM + RAMNotes
UD-TQ1_0 (1-bit)~75 GBany GPU + 96 GB RAMNative Ollama support; quality drop noticeable.
UD-Q2_K_XL (2-bit)~135 GB24 GB VRAM + 128 GB RAMRecommended balance.
Q4_K_XL (4-bit)~210 GB40 GB VRAM + 165 GB RAMCloser to BF16 quality; needs an A100/H100 or Mac Studio Ultra.
BF16 full~720 GB8× H100 or equivalentProduction deployment.

llama.cpp run command (Q2 quant)

./build/bin/llama-cli \
  -m ./GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
  --jinja \
  --temp 1.0 --top-p 0.95 \
  --ctx-size 32768 \
  -ngl 99 \
  -ot ".ffn_.*_exps.=CPU"

Ollama (1-bit only on most consumer hardware)

OLLAMA_MODELS=unsloth ollama run hf.co/unsloth/GLM-4.7-GGUF:TQ1_0

Path 4: vLLM and SGLang for multi-GPU production

Both engines support GLM-4.7 on their main branches as of April 2026. Use the FP8 weights for the best memory/quality trade-off on H100s.

vLLM

pip install -U vllm --pre \
  --index-url https://pypi.org/simple \
  --extra-index-url https://wheels.vllm.ai/nightly

vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-fp8

SGLang

docker pull lmsysorg/sglang:dev

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --served-model-name glm-4.7-fp8 \
  --host 0.0.0.0 --port 8000

Both expose OpenAI-compatible endpoints. Pair them with the broader local-agent stack — for example, the OpenClaw + Ollama setup guide for running local AI agents walks through wiring a self-hosted endpoint into a Claude-Code-style coding agent, which is the most common reason teams self-host GLM-4.7 in the first place.

Configuring thinking mode and tool calling

By default vLLM and SGLang enable thinking mode. Disable it for short, latency-sensitive tasks:

{
  "model": "glm-4.7",
  "messages": [...],
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

For multi-turn agents where you want the reasoning chain to persist across turns (Preserved Thinking), set:

"chat_template_kwargs": {
  "enable_thinking": true,
  "clear_thinking": false
}

Tool calling uses the OpenAI tools schema. The vLLM/SGLang flags --tool-call-parser glm47 and --reasoning-parser glm45 are required — using the older glm4_moe parser will silently break tool dispatch on some prompts.

Decision tree: which engine for which workload

  • You want answers in 5 minutes → Z.ai API + the Python snippet above. Done.
  • You want the cheapest hosted coding agent → Z.ai Coding Plan Lite ($30/quarter), then plug Claude Code, Cline, or Roo Code into it.
  • You need 1,000+ tok/s for a production agent loop → Cerebras Inference.
  • You have a 24 GB GPU and want offline coding → GLM-4.7-Flash through Ollama.
  • You have a workstation with 128 GB RAM and a 24 GB GPU → llama.cpp + UD-Q2_K_XL with MoE offload. Expect 5–15 tok/s.
  • You have multi-GPU H100s → vLLM with the FP8 weights and speculative MTP. This is what production deployments at small AI labs are using.
  • You want frontier open-weights and the hardware budget → look at GLM-5 (Feb 2026, 745B/44B active) instead. It surpasses GLM-4.7 on most coding benchmarks.

Independent benchmarks (April 2026)

The numbers Z.ai publishes match what third parties measure on most workloads, but a few real-world data points are worth knowing:

  • Artificial Analysis ranks Cerebras' GLM-4.7 endpoint at ~922 tok/s output, vs ~33.8 tok/s on Amazon Bedrock — a 27.2× spread. See artificialanalysis.ai/models/glm-4-7.
  • Code Arena (LMArena's coding leaderboard): GLM-4.7 was the #1 open-source entry on launch and remained in the top 5 overall through Q1 2026.
  • WebDev Arena: #6 overall, #1 open model.
  • Practitioner reports on r/LocalLLaMA and Hacker News are mixed in the way you would expect from a model this large: very strong on UI/web generation and tool calls; weaker than Qwen3-30B-A3B-Thinking on some pure-logic LiveBench problems; "best of any 70B-or-less model" feedback for the Flash variant.
  • GLM-4.7 vs Claude Sonnet 4.5: GLM-4.7 wins on LiveCodeBench v6 (84.9 vs 64.0) and ties or trails on SWE-bench Verified (73.8 vs ~77 for Sonnet 4.5). Sonnet 4.5 is still preferred for "code taste" by many testers; GLM-4.7 is preferred for cost.

Common pitfalls and troubleshooting

  • Forgetting --jinja in llama.cpp is the single most common bug report. The chat template encodes both the thinking blocks and the tool-call format; without it, you will see literal <|assistant|> tokens in the output and tool calls will not parse.
  • Wrong tool-call parser: --tool-call-parser glm47 is the correct parser for GLM-4.7 and 4.7-Flash. The older glm4_moe parser was for GLM-4.5/4.6.
  • Underestimating MoE memory: even though only 32B parameters are active, all 358B must be loaded. Active-parameter count tells you compute cost, not memory cost.
  • Ollama hangs at first prompt: usually means the model is paging from disk because RAM is full. Either drop to a smaller quant or add swap (and accept the speed penalty).
  • Loop / repetition at long contexts (especially Flash, pre-January 21, 2026): Unsloth re-uploaded all GLM-4.7-Flash quants on January 21 to fix this. Re-pull if you grabbed yours earlier.
  • Cheap third-party providers serving degraded quants: a recurring complaint on HN is that some OpenRouter upstreams serve INT4 quants of GLM-4.7 without disclosure, leading to noticeably worse outputs than Z.ai direct. Pin to a known-good provider when quality matters.
  • Tool calls inside thinking blocks: GLM-4.7 emits tool calls only after the thinking block closes. If your harness streams partial JSON during reasoning, you will see false "no tool call" errors. Wait for the finish_reason.

GLM-4.7 vs the competitive landscape (April 2026)

ModelTotal / Active paramsLicenseContextSWE-bench VIndicative price ($/M out)
GLM-4.7358B / 32B MoEMIT200K73.8$2.20
GLM-4.7-Flash30B / ~3.6B MoEMIT131K59.2$0 (free)
GLM-5~745B / 44B MoEMIT200KSOTA open~$2.56 (OpenRouter)
Claude Sonnet 4.5proprietary200K~77.2$15
DeepSeek V4~671B / 37B MoEMIT128K~70$1.10
Qwen3-30B-A3B-Thinking30B / 3B MoEApache-2.0256K~22$0–0.30
GPT-5.5 (closed)proprietary~256K~78~$20

When self-hosting actually makes sense

The honest answer in 2026: most teams should use the API. The break-even on self-hosting GLM-4.7 against Z.ai's $2.20/M output pricing is a steady ~50M+ output tokens/month before electricity, hardware amortization, and operator time start beating the cloud bill — and that ignores the 27× speed advantage Cerebras has over consumer GPUs. Self-hosting wins on three specific axes: regulatory data-residency requirements, sustained agent loops where Cerebras-class throughput is unaffordable, and R&D where you need to fine-tune. If you are evaluating whether to staff up for that workload, Codersera's network of vetted remote ML and infrastructure engineers is built around exactly this kind of work.

FAQ

Is GLM-4.7 free to run?

The weights are MIT-licensed, so yes — you can run them locally for free. The Z.ai API charges $0.60/M input and $2.20/M output for GLM-4.7, but GLM-4.7-Flash is currently free on the Z.ai API. Self-hosting only saves money at very high sustained volume.

What hardware do I need to run GLM-4.7 locally?

For the full 358B model: minimum 24 GB VRAM + 128 GB system RAM with the UD-Q2_K_XL quant and MoE offloading, expect 5–15 tok/s. For GLM-4.7-Flash (the 30B-A3B variant): a single 24 GB consumer GPU is enough and you will see 60–220 tok/s.

Should I use GLM-4.7 or GLM-5?

GLM-5 (released February 11, 2026, 745B/44B active) outperforms GLM-4.7 on most coding and agentic benchmarks but needs significantly more hardware to self-host — 8× H100 or equivalent. If you are using a hosted API and price is similar, GLM-5 is the better default for new projects. If you are self-hosting, GLM-4.7 or GLM-4.7-Flash are the practical picks.

How does GLM-4.7 compare to Claude Sonnet 4.5 for coding?

GLM-4.7 wins on LiveCodeBench v6 (84.9 vs 64.0) and is competitive on SWE-bench Verified (73.8 vs ~77.2 for Sonnet 4.5). On subjective code-quality / "taste" tests, Sonnet 4.5 is still preferred by many practitioners. On pure cost, GLM-4.7 is roughly 7× cheaper per output token.

What is the difference between GLM-4.7 and GLM-4.7-Flash?

GLM-4.7 is the 358B/32B-active flagship. GLM-4.7-Flash is a 30B/3.6B-active distilled variant designed for single-GPU inference. They share the same chat template, tool-call format, and reasoning modes — Flash is the right choice when you cannot fit the full model and the right choice for cost-sensitive batch agents (it is currently free on Z.ai).

Does GLM-4.7 support tool calling?

Yes — it uses the standard OpenAI tools schema. In vLLM and SGLang you must pass --tool-call-parser glm47 (not the older glm4_moe) and --reasoning-parser glm45. τ²-Bench score on tool use is 87.4, the highest among open-source models on launch.

What is "Preserved Thinking"?

It is GLM-4.7's mode that keeps the chain-of-thought tokens in the conversation history across turns instead of discarding them after each reply. For long agent sessions this means the model can refer back to its earlier reasoning, which materially improves multi-step task completion at the cost of more context tokens.

Why does my llama.cpp run produce garbled output?

Almost always because --jinja was omitted. GLM-4.7 uses a custom chat template that encodes thinking blocks and tool-call delimiters; without --jinja, llama.cpp falls back to the legacy template and the output looks broken.

References & further reading