GLM-5.2 complete guide (2026)

Z.ai's GLM-5.2 is the leading open-weights LLM on the Artificial Analysis Intelligence Index v4.1. 744B params (40B active), 1M-token context, MIT-licensed weights. Architecture, benchmarks, pricing, and a 3-path local-inference playbook.

Published 16 Jun 2026 • Updated 06 Jul 2026 • 14 min read

Quick answer. GLM-5.2 is Z.ai's flagship open-weights LLM released on 13 June 2026. It is a 744B-parameter Mixture-of-Experts model (~40B active per token) with a usable 1M-token context window, MIT-licensed weights, and two reasoning-effort levels. On the Artificial Analysis Intelligence Index v4.1 it scores 51, the highest of any open-weights model to date, and it is reported to match Claude Opus 4.8 and beat GPT-5.5 on several long-horizon coding benchmarks.

Update — July 2026: no newer GLM release yet — GLM-5.2 remains Z.ai's flagship. Two additions since publish: OpenRouter now routes GLM-5.2 at $0.56/M in / $1.76/M out (cheaper than z.ai direct at $1.40/$4.40), and for local runs note that mainline llama.cpp uses a dense-attention fallback — the sparse DSA path isn't supported yet (issue #24730), so long-context throughput is below the advertised numbers.

Also in this series

This pillar sits alongside our other frontier-model deep-dives. If you are evaluating GLM-5.2 you probably want to compare it against:

DeepSeek V4 complete guide (2026) — the other MIT-licensed MoE giant.
Kimi K2.6 complete guide (2026) — Moonshot's agent-swarm flagship.
Claude Opus 4.7 complete guide (2026) — the closed-frontier reference Anthropic ships.
GPT-5.5 complete guide (2026) — OpenAI's current flagship.
Open-source LLMs landscape (2026) — the field, including Llama 4 and the long tail.

What is GLM-5.2?

GLM-5.2 is the flagship model in the GLM-5 family from Z.ai (the consumer-facing brand of Zhipu AI, the Tsinghua-spinout team behind the General Language Model lineage). It was announced on 13 June 2026 and the MIT-licensed open weights, standalone API, and chatbot rolled out across the following week (MarkTechPost, 14 June 2026).

The pitch is coding-first agentic work at open-weights pricing. GLM-5.2 ships with a usable 1M-token context window, two reasoning-effort levels ("thinking" and "max thinking"), and the same MIT license that has made earlier GLM checkpoints popular for self-hosted deployments. It is delivered through the GLM Coding Plan (Lite / Pro / Max / Team tiers) and through Z.ai's metered API, and the open weights are published under zai-org/GLM-5.2 on Hugging Face.

For Codersera readers the headline shift is the leverage. Previously, teams chasing Opus-4-class quality had to either pay closed-API rates or accept a quality gap in exchange for self-hosting. GLM-5.2 narrows that gap meaningfully while keeping weights you can actually run inside your own VPC.

How does GLM-5.2 fit in the GLM lineage?

The GLM family started as a bilingual (English-Chinese) research line out of Tsinghua University in 2021 and has shipped roughly two major versions per year since. The GLM-5 generation in particular has been positioned around "vibe coding to agentic engineering" (the title of the GLM-5 family arXiv paper). GLM-5.0 introduced the modern MoE shape; GLM-5.1 raised the context to 200K and tightened tool-use; GLM-5.2 is the agentic-coding flagship with the jump to a 1M context window and substantially better long-horizon scores. On Artificial Analysis's Intelligence Index v4.1, the version-over-version delta from GLM-5.1 (40) to GLM-5.2 (51) is +11 points — a larger jump than most minor-version releases (Artificial Analysis, 17 Jun 2026).

Architecture: MoE shape, parameters, context window

GLM-5.2 is a Mixture-of-Experts transformer.

Total parameters: 744 billion (some reporting cites 753B; both figures appear across vendor and reseller pages — Artificial Analysis uses 744B / 40B active in its model card).
Active parameters per token: ~40B.
Context window: 1,000,000 tokens (variant glm-5.2[1m]); 5x the 200K window in GLM-5.1.
Max output tokens: 131,072 per response.
Reasoning modes: two thinking-effort levels exposed at the API, letting callers trade latency for quality.
License: MIT, weights on Hugging Face (zai-org/GLM-5.2).

Sources for the architecture claims above: Artificial Analysis model card, MarkTechPost launch report (14 June 2026), and Unsloth's GLM-5.2 run-locally guide.

Internally, the architecture introduces an updated multi-token-prediction (MTP) layer for speculative decoding and an "IndexShare" component that vendor write-ups describe as a routing/sharing optimisation across experts. Z.ai has not published the full technical report at the time of writing, so these structural notes should be treated as vendor-confirmed but independently unverified.

Benchmarks vs Opus 4.8, GPT-5.5, DeepSeek V4, Kimi K2.7

Z.ai notably published no benchmark numbers at launch, which is unusual for a flagship release (MarkTechPost, 14 June 2026). The numbers below come from third-party evaluations (Artificial Analysis), VentureBeat reporting, and community reviews.

Benchmark	GLM-5.2	Comparator	Source
AA Intelligence Index v4.1	51 (leading open-weights)	Claude Opus 4.8: 56 · GPT-5.5 (xhigh): 55 · MiniMax-M3 / DeepSeek V4 Pro: 44	Artificial Analysis, 17 Jun 2026
GDPval-AA v2 (agentic)	1524	GPT-5.5 (xhigh): 1514 · MiniMax-M3: 1418 · DeepSeek V4 Pro max: 1328	Artificial Analysis, 17 Jun 2026
AA-Briefcase (long-horizon knowledge work)	1266	Claude Fable 5: 1587 · Claude Opus 4.8 (max): 1356	Artificial Analysis, 17 Jun 2026
SWE-bench Pro (coding)	62.1	GPT-5.5: 58.6	VentureBeat, Jun 2026
FrontierSWE	74.4%	GPT-5.5: 72.6% · Claude Opus 4.8: 75.1%	VentureBeat, Jun 2026
Terminal-Bench 2.1 (agentic shell)	81.0	Claude Opus 4.8: 85.0 · GLM-5.1: 62.0	The Decoder, Jun 2026
MCP-Atlas (tool-use)	77.0	GPT-5.5: 75.3 · Claude Opus 4.8: 77.8	The Decoder, Jun 2026
AIME 2026 (math)	99.2 (reported)	—	Unsloth docs
GPQA-Diamond (science)	91.2 (reported)	—	Unsloth docs
HLE (with tools)	40 (+12 vs GLM-5.1) (reported)	—	Artificial Analysis, 17 Jun 2026
CritPt (science)	21 (+16 vs GLM-5.1) (reported)	—	Artificial Analysis, 17 Jun 2026

The headline reading: GLM-5.2 leads open-weights on every public agentic and coding benchmark we could find, and is competitive with — though not ahead of — the frontier closed models (Claude Fable 5, Claude Opus 4.8) on the broader Intelligence Index. On long-horizon coding specifically (SWE-bench Pro, FrontierSWE) it edges past GPT-5.5 and lands near Opus 4.8.

One known efficiency caveat: Artificial Analysis flags that GLM-5.2 burns roughly 43k output tokens per task in their evaluation harness, versus ~24k for MiniMax-M3 and ~35k for Kimi K2.6 — so the strong intelligence numbers come at a token-efficiency cost (Artificial Analysis).

How to run GLM-5.2 (three paths)

Three deployment paths cover most production needs. We list them from lowest to highest operational lift.

Path 1: Z.ai API (cloud)

The lowest-friction path. Z.ai's metered API exposes GLM-5.2 directly at $1.40 / 1M input tokens, $0.26 / 1M cached-input tokens, and $4.40 / 1M output tokens (Artificial Analysis pricing). That puts output cost at roughly 5x–8x cheaper than Claude Opus 4.8 and ~1/6 of GPT-5.5 Pro for equivalent workloads (VentureBeat, Jun 2026).

The model is also routed through OpenRouter and other aggregators if you want OpenAI-compatible request shapes without re-keying. For coding-agent-heavy usage the GLM Coding Plan subscription (Lite $18/mo → Pro/Max/Team) is often cheaper than the metered API.

Path 2: Unsloth GGUF (locally, with quantisation)

Unsloth has published dynamic GGUF quants that compress the 1.51 TB BF16 checkpoint down dramatically:

Quant	Disk size	Memory required (RAM + VRAM)
UD-IQ1_S (1-bit dynamic)	217 GB	~223 GB
UD-IQ2_M (2-bit dynamic)	239 GB	~245 GB
4-bit	—	372–475 GB
8-bit	—	~810 GB
Full BF16	1.51 TB	—

According to Unsloth, the 2-bit dynamic quant retains roughly 82% of the BF16 accuracy across their evaluation set while being 84% smaller (Unsloth docs). Realistic configurations:

Mac with 256 GB unified memory — the 2-bit quant fits directly. Practical for solo developers running tools like llama.cpp or LM Studio.
1×24 GB GPU + 256 GB system RAM — 2-bit quant with MoE offloading to RAM. Best price/performance for a single workstation.
Multi-GPU rigs (e.g., 4×3090 + ~192 GB RAM) — feasible per community reports on r/LocalLLaMA, with throughput in the single-digit tokens/second range.
CPU-only on dual-Xeon + ~768 GB RAM — runnable but slow; reasonable for batch inference rather than interactive use.

Per Unsloth, expect roughly 3–9 tokens/second on consumer hardware with 2-bit quants depending on memory bandwidth and offloading strategy (Unsloth docs). Community-reported throughput on 4×3090 setups falls inside that envelope. We have not independently re-benchmarked these numbers.

Path 3: vLLM or SGLang on a fully-loaded server

For production self-hosting (high QPS, low latency, batch serving), vLLM with the BF16 or FP8 weights is the canonical path. This is the route to use if you are operating GLM-5.2 inside a VPC for data-residency or compliance reasons — for example, a regulated codebase you cannot send to a third-party API.

Weight memory at FP8 is approximately 744 GB; BF16 is approximately 1.488 TB, before KV-cache and ~10–20% runtime overhead. Practical configurations for production serving start at 16×H100 80GB for FP8 with comfortable headroom, with 8×B200-class hardware as an alternative (Spheron deployment guide). On vLLM, the relevant flag is --enable-expert-parallel; on SGLang, --enable-moe-ep. Both runtimes distribute experts across GPUs and route tokens via NVLink/NVSwitch.

The MIT license allows commercial deployment without revenue thresholds. For agentic workloads, GLM-5.2 supports the OpenAI-compatible function-calling format and integrates cleanly into Claude Code, Cline, Cursor, and other open-source agent harnesses by pointing the base URL at your self-hosted endpoint and exposing the model id (apidog integration guide). Tool-call parsing is robust across multi-step sessions in our reading of the public benchmarks — the 81.0 Terminal-Bench 2.1 score is a useful proxy for shell-agent reliability.

If you want a deeper walkthrough of self-hosting at this scale across model families, our self-hosting LLMs complete guide (2026) covers cluster sizing, KV-cache tuning, and the choice between vLLM, SGLang, and TGI for MoE workloads. Our AI coding agents complete guide (2026) covers the harness side — what to wire GLM-5.2 into once it is serving.

What use cases is GLM-5.2 best at?

From the benchmark profile and early community reports, GLM-5.2 is most differentiated on three workload shapes.

Long-horizon coding agents. SWE-bench Pro (62.1) and FrontierSWE (74.4%) are explicitly long-horizon multi-file benchmarks; the 81.0 Terminal-Bench 2.1 result reinforces that this model holds context across many tool calls. If you are running an agent that needs to make 20+ sequential edits across a real codebase, GLM-5.2 is currently the strongest open-weights option.
Repository-scale comprehension. The 1M context window is the practical enabler. You can drop in a mid-sized service (~50K LOC of dense code, or ~150K LOC of typical TypeScript) and have the model reason over it in a single call rather than chunking through retrieval.
Self-hosted MCP tool-use. The 77.0 MCP-Atlas score is just below Claude Opus 4.8's 77.8, and the MIT license means you can stand it up behind your firewall without sending tool-call payloads to a third-party endpoint. That combination is hard to find elsewhere in mid-2026.

Workloads where GLM-5.2 is not the obvious pick: anything vision-heavy (no native vision support at launch), latency-sensitive realtime interactive use cases where its high output-token consumption (43k/task on AA-Briefcase) hurts wall-clock time, and use cases needing the absolute frontier on hard reasoning where Claude Fable 5 still leads.

What are GLM-5.2's limitations?

Three honest constraints to plan around.

No vision modality at launch. Confirmed by Jeremy Howard's hands-on assessment cited above and by the model card on Hugging Face. For multimodal pipelines you will still need a vision-capable model in the chain.
Token-hungry on long agents. Artificial Analysis measures ~43k output tokens per AA-Briefcase task vs ~24k for MiniMax-M3. At $4.40 / 1M output tokens that is still cheap in absolute terms, but it shows up as latency in interactive agents.
Self-identification quirk. GLM-5.2 sometimes insists it is Claude. Documented, harmless to code-correctness, but you will want to guard system prompts that rely on model self-identification.
Sparse first-party documentation. Z.ai published no benchmark report at launch and the full technical report had not appeared at the time of writing. Most of the technical detail in the wild today comes from third parties; depending on your compliance posture this can be a procurement friction.

How do you call the GLM-5.2 API?

The Z.ai API is OpenAI-SDK-compatible. Once you have a key from the GLM Coding Plan or the metered API, the minimum-viable Python call looks like this:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_ZAI_KEY",
    base_url="https://api.z.ai/api/paas/v4",
)

resp = client.chat.completions.create(
    model="glm-5.2",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Refactor this Express handler to async/await..."},
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

For the 1M-context variant, swap glm-5.2 for glm-5.2[1m] and set the appropriate max_tokens for the response (up to 131,072). To use the higher reasoning mode, pass the vendor-specific thinking-effort parameter as documented in the Z.ai API reference. If you are routing through OpenRouter, the model id is z-ai/glm-5.2 and the rest of the call shape is unchanged.

Real-world impressions

Two community signals worth flagging.

Jeremy Howard's read. Fast.ai co-founder Jeremy Howard shared in mid-June 2026 that for his use he found GLM-5.2 "at least as good as Opus 4.8 and GPT 5.5", with the main gap being its lack of vision support. The post drew several thousand engagements and is consistent with the third-party Artificial Analysis numbers above.

The Claude-identity quirk. Developer Cooper (@peakcooper) demonstrated that GLM-5.2 sometimes insists, when asked, that it is Claude from Anthropic — and refuses to update its self-identification even after being shown its local agent config. This is consistent with patterns seen in earlier GLM checkpoints and in DeepSeek V3 (which sometimes called itself ChatGPT). The most-cited explanation is training-data contamination from Claude responses scraped from the web or used during instruction-tuning; this remains an unverified hypothesis. It is not a correctness issue for code or reasoning, but it is something to handle if you embed model-identification into system prompts or guardrails.

Together these say something useful: GLM-5.2 is real, the quality is in the same neighbourhood as the closed frontier, and the rough edges are operational — not capability — concerns.

Leaderboard update (June 2026). GLM-5.2 also took the #1 spot on Design Arena for web and UI generation, posting an Elo of 1360 to top the board ahead of the field. That result lines up with its #1 open-weights placement on the Artificial Analysis Intelligence Index, and for teams generating front-end or design code it is the most directly relevant third-party signal to date.

GLM-5.2 vs other frontier open-weights

How does GLM-5.2 line up against the other open-weights models a serious team would shortlist in mid-2026?

Model	Params (total / active)	Context	License	AA Intelligence Index
GLM-5.2	744B / 40B	1M	MIT	51
DeepSeek V4 Pro	1.6T / 49B	1M	MIT	44
DeepSeek V4 Flash	284B / 13B	1M	MIT	—
Kimi K2.6 / K2.7	~1T-class MoE	262K	Modified MIT	—
MiniMax-M3	—	—	open weights	44
Llama 4 (latest)	—	—	Llama community	—

Index scores from Artificial Analysis (17 Jun 2026). The dashes indicate values we did not verify on a primary source for this guide.

If you are evaluating across the family rather than just on GLM-5.2 in isolation, the matching Codersera pillars are useful comparison companions:

DeepSeek V4 complete guide (2026) — frontier MoE, MIT license, native 1M context, Flash and Pro variants.
Kimi K2.6 complete guide (2026) — agent-swarm-first architecture from Moonshot.
Claude Opus 4.7 complete guide (2026) — closed-frontier reference for the same workloads.
GPT-5.5 complete guide (2026) — the other closed-frontier reference.
Open-source LLMs landscape (2026) — the wider field, including Llama 4 and the rest of the long tail.

Heuristic for picking between open-weights options today: if you need maximum raw quality on agentic coding inside a self-hostable, MIT-licensed model, GLM-5.2 is the current default. If you need cheaper inference and are willing to trade some quality, DeepSeek V4 Flash is the inexpensive workhorse. If your workload is heavy on multi-agent orchestration, Kimi K2.6 / K2.7 are worth a serious look.

Economics: GLM-5.2 vs the closed frontier

Cost is the second axis of the open-weights argument. At $1.40 / 1M input and $4.40 / 1M output, GLM-5.2's metered API runs at roughly 5x to 8x lower output cost than Claude Opus 4.8, and approximately one-sixth the cost of GPT-5.5 Pro for equivalent workloads (VentureBeat, Jun 2026). With cached input at $0.26 / 1M, large agent workloads that re-use system prompts and codebase context can drop another order of magnitude in cost.

Self-hosted economics are different but increasingly competitive: a single 8×H100 cluster (~$10/hr on most clouds) sustains hundreds of tokens-per-second of GLM-5.2 throughput; at the $4.40 / 1M output price, breakeven against the Z.ai API arrives somewhere around 600M output tokens per month per cluster, depending on utilisation. Most teams paying for an open-weights deployment are not optimising for cost — they are optimising for data control, latency on tail requests, or the ability to fine-tune on proprietary code. GLM-5.2's MIT license unlocks all three without revenue thresholds.

The simpler take: if you are spending under ~$15K/month on coding-model inference and don't have a data-residency requirement, the API path is almost certainly cheaper than self-hosting. Above that, the math tips toward your own cluster.

FAQ

When was GLM-5.2 released?

Z.ai announced GLM-5.2 on 13 June 2026 across the GLM Coding Plan tiers, with the metered API and MIT-licensed open weights rolling out across the following week (MarkTechPost, 14 June 2026).

How many parameters does GLM-5.2 have?

It is a Mixture-of-Experts model with approximately 744 billion total parameters and about 40 billion active per token, per the Artificial Analysis model card. Some vendor write-ups quote 753B; both numbers appear in the wild.

What context window does GLM-5.2 support?

A usable 1,000,000-token context window in the glm-5.2[1m] variant, with up to 131,072 output tokens per response — a 5x jump from GLM-5.1's 200K window.

Is GLM-5.2 open source?

The weights are released under the MIT license on Hugging Face under zai-org/GLM-5.2. The training code and full technical report were not published at launch. "Open weights" is therefore more accurate than "open source" in the strict sense.

Can I run GLM-5.2 locally?

Yes, via Unsloth's dynamic GGUF quantisations. A 2-bit dynamic quant fits on a 256 GB unified-memory Mac or a 1×24 GB GPU plus 256 GB of system RAM, with throughput in the 3–9 tokens/second range on consumer hardware (Unsloth docs). The full BF16 footprint of 1.51 TB requires server-class hardware (typically 8×H100-80GB or similar).

How does GLM-5.2 compare to GPT-5.5 and Claude Opus 4.8?

On SWE-bench Pro (62.1) and FrontierSWE (74.4%) GLM-5.2 edges past GPT-5.5; on FrontierSWE it is also within ~1 point of Claude Opus 4.8 (VentureBeat, Jun 2026). On the broader Artificial Analysis Intelligence Index v4.1 it sits behind Opus 4.8 (56) and GPT-5.5 xhigh (55) at 51 — still the open-weights leader (Artificial Analysis, 17 Jun 2026).

Why does GLM-5.2 sometimes say it is Claude?

This is a well-documented behavioural quirk seen in GLM-series and other Chinese open-weights models. The most-cited explanation is training-data contamination from Claude outputs (whether scraped from the web or used in instruction-tuning) but this remains a hypothesis without a published technical audit. It does not affect code or reasoning correctness; it is a self-identification artefact only.

How much does the GLM-5.2 API cost?

Direct Z.ai pricing is $1.40 / 1M input tokens, $0.26 / 1M cached-input tokens, and $4.40 / 1M output tokens (Artificial Analysis pricing). The GLM Coding Plan subscription starts at $18/month for the Lite tier.

Building with GLM-5.2? Hire the engineers who know open-weights operations.

The hard part of GLM-5.2 isn't picking the right quant. It's the production tail: MoE-aware KV-cache tuning, multi-GPU sharding strategies for 744B-parameter checkpoints, agent loops that don't waste the 1M context, and evals that match your real workload rather than someone else's leaderboard. That is engineering work, and the talent pool with hands-on open-weights operations experience is still small.

Codersera's vetted developers have shipped LLM-backed systems in production — self-hosted inference, agent stacks, retrieval pipelines, the lot. If you are standing up GLM-5.2 (or any frontier open-weights model) inside your stack and want senior help fast, talk to us about extending your team. Risk-free trial, week-one productive, no long contract.