Ornith 1.0 vs GLM 5.2: Best Open Coding Model?

Both shipped in late June 2026 as MIT open-weights coding models. GLM-5.2 (~750B MoE, 1M context) narrowly leads Terminal-Bench 2.1 (81.0 vs 77.5) and offers a hosted API at $1.4/$4.4 per 1M tokens. Ornith-1.0 is self-host-only, but its 35B variant runs on a single consumer GPU. Pick GLM-5.2 for hosted long-context work; pick Ornith-35B for local agentic coding.

Within about 24 hours of each other in late June 2026, two labs unveiled MIT-licensed open-weights coding models aimed squarely at agentic workflows — the kind where a model runs in a loop inside a terminal, edits files, runs tests, and iterates until a task is done. On June 24, Z.ai (Zhipu) published GLM-5.2, a ~750B-class mixture-of-experts model with a usable 1M-token context. On June 25, a lab called DeepReinforce shipped Ornith-1.0, not a single model but a four-model family (9B, 31B, 35B MoE, 397B MoE) built on a "self-scaffolding" reinforcement-learning idea.

Both are genuinely strong, and both are free to download and run. Each lab benchmarks mostly against the closed frontier — Ornith against Claude Opus 4.7, GLM-5.2 against Opus 4.8 — though DeepReinforce's own table also puts the two head-to-head. So the question developers are actually asking — Ornith 1.0 vs GLM 5.2, which is the better open coding model right now? — has a more interesting answer than "whichever benches higher." At the flagship tier they trade blows on benchmarks. The real fork is deployment: one of these you can run on a laptop, the other you basically have to rent. This piece walks through architecture, the coding benchmarks (and the cross-vendor caveats that matter more than the scores), hardware requirements, API pricing, and a concrete pick-this-for-that verdict.

What shipped, and why a developer should care

The short version: in a single week, two open-weights coding families landed within a few points of the closed frontier on the vendors' own benchmarks — and they did it in two completely different shapes.

GLM-5.2 went live first inside Z.ai's GLM Coding Plan, with the MIT weights and the public API following roughly a week later. It is one big model. Its headline numbers are a 1M-token context and an 81.0 on Terminal-Bench 2.1 — up from GLM-5.1's 63.5 — which puts it within a few points of Claude Opus 4.8 (85.0) and, per Z.ai's own table, ahead of Gemini 3.1 Pro.
Ornith-1.0 is the opposite bet: a family that scales from a 9B you can run on a single GPU to a 397B flagship. Its pitch is not "we beat Opus" (though DeepReinforce frames the 397B as matching Opus 4.7); it's that the smaller checkpoints punch far above their weight because of how they were trained. The 35B MoE in particular is the standout — it competes near the flagship tier on coding benchmarks while staying runnable on consumer hardware (more below).

If you build with AI coding agents — OpenHands, a custom harness, or anything that takes an OpenAI-compatible endpoint — these two releases matter because they're among the strongest open-weights options yet for real agentic tasks, landing within a few points of a closed flagship on the vendors' own benchmarks, not just on autocomplete. For broader context on where these land, our open-source LLMs landscape and AI coding agents guide map the surrounding field.

What is Ornith 1.0?

Ornith-1.0 is DeepReinforce's open-weights coding family, released June 25, 2026, under an MIT license. There are four checkpoints, all post-trained on top of pretrained Gemma 4 and Qwen 3.5 bases:

9B Dense — the "edge" model, fits a single 80GB GPU comfortably and runs quantized on a laptop.
31B Dense — listed in the announcement and blog.
35B MoE — the sweet spot for local agentic coding.
397B MoE — the flagship.

Every checkpoint ships a 256K (262,144-token) context window and is released in bf16, FP8, and GGUF (quantized for llama.cpp / Ollama). One practical caveat worth flagging up front: the announcement lists a 31B Dense variant, but the GitHub serving recipes only document the 9B, 35B, and 397B checkpoints. If you're planning around the 31B, confirm the checkpoint is actually on Hugging Face before you commit — don't assume it from the marketing.

The "self-scaffolding" idea, briefly

The interesting part of Ornith isn't the parameter counts; it's the training method. Most RL-for-code setups bolt the model into a fixed, human-designed harness (the scaffold that decides how a task gets decomposed, which tools get called, how retries work) and then optimize only the model's answers inside that fixed structure. Ornith's framework optimizes both halves. At each RL step the model first proposes a refined, task-specific scaffold, then generates a solution rollout inside it, and the reward propagates back to both stages. In other words, the model learns the orchestration that elicits its own better answers, not just the answers.

Why that matters in practice: a lot of an agent's real-world performance comes down to harness quality — how it breaks a problem down, when it runs tests, how it recovers from a failed edit. A model that has learned to build its own harness during training tends to behave better when you drop it into a generic agent loop, because it isn't relying on you to hand-craft the perfect scaffold. DeepReinforce devotes a section of its writeup to addressing reward hacking, which matters because letting a model author its own scaffold is exactly where reward hacking shows up (a self-generated scaffold can learn to satisfy the verifier without actually doing the task).

One non-negotiable operational detail: Ornith is a reasoning model that emits <think>…</think> blocks. If you load it naively without a reasoning parser, those blocks will dump straight into your output. It needs recent runtimes — Transformers ≥ 5.8.1, vLLM ≥ 0.19.1, SGLang ≥ 0.5.9 — and the recommended sampling is temperature=0.6, top_p=0.95, top_k=20 (the benchmark numbers below were produced at temperature=1.0).

What is GLM-5.2?

GLM-5.2 is Z.ai's flagship open-weights coding model, also MIT-licensed, published June 24, 2026. Where Ornith is a family, GLM-5.2 is a single large MoE. Its exact parameter count is genuinely uncertain: DeepReinforce's comparison table labels it "GLM-5.2-744B," while community posts cite 754B and 756B. Treat it as a ~750B-class MoE — that's the honest framing until Z.ai publishes a precise figure — versus Ornith's largest at 397B.

Two architectural pieces define it:

IndexShare sparse attention. Every 4 sparse-attention (DSA) layers share one lightweight indexer, which cuts per-token FLOPs by 2.9x at 1M context. This is the trick that makes a 1M-token window "solid" rather than nominal — it's what keeps long-context inference from collapsing under its own quadratic cost.
Improved MTP layer that raises speculative-decoding acceptance length by up to 20%, i.e. faster decoding.

GLM-5.2 exposes two reasoning-effort levels — GLM-5.2 (max) and GLM-5.2 (high). Z.ai explicitly recommends max for coding. The other big practical fact: there is a first-party hosted API at the same price as GLM-5.1, so you can use this model today without owning a GPU cluster. We go deeper on the model itself in the GLM-5.2 complete guide, and compare it to the closed frontier in GLM-5.2 vs Claude Opus 4.8.

One limitation worth flagging: GLM-5.2 is text-only. No vision. For backend, infra, and CLI work that's irrelevant, but for iterative UI and design work it's a real handicap — more on that in the verdict.

How do the coding benchmarks compare?

Here's the head-to-head at flagship scale, drawn from both vendors' own published tables. Read the caveat below before you weight any single number.

Benchmark (flagship)	Ornith-1.0-397B	GLM-5.2 (max)	Reference
Terminal-Bench 2.1 (Terminus-2)	77.5	81.0	Opus 4.8 = 85.0
SWE-bench Pro	62.2	62.1	— (effective tie)
SWE-bench Verified	82.4	Not reported	Opus 4.7 = 80.8
NL2Repo	48.2	48.9	—

The pattern: GLM-5.2 leads Terminal-Bench 2.1 (81.0 vs 77.5), they're a dead tie on SWE-bench Pro (62.1 vs 62.2), GLM edges NL2Repo by 0.7, and Ornith reports a strong SWE-bench Verified of 82.4 while GLM-5.2's Verified number simply isn't in either table. So on the metric where Ornith looks best (Verified), there's no GLM number to compare against; on the metric where GLM looks best (Terminal-Bench), it's a clear but not huge lead. Calling either one "the better coder" from this table alone is overreach.

The caveat that matters more than the numbers

These are not apples-to-apples. Ornith's comparison table runs its own harnesses — OpenHands, mini-SWE-agent, Terminus-2 / Claude Code — at temperature 1.0. GLM-5.2's blog leans on a different set of long-horizon benchmarks entirely (FrontierSWE, PostTrainBench, SWE-Marathon). Both vendors do publish their benchmark setups, but they don't use the same ones — so when DeepReinforce reports a GLM-5.2 column, it's GLM running inside DeepReinforce's harness, not Z.ai's. Cross-vendor self-reported wins are as much marketing artifacts as measurements. Trust the direction (these two are roughly peer-level at the top), not the second decimal place.

On GLM-5.2's own long-horizon turf, Z.ai reports it's the highest-ranked open-source model on all three of its benchmarks: on FrontierSWE it trails Opus 4.8 by ~1% while edging GPT-5.5 by 1% and Opus 4.7 by 11%; on PostTrainBench it beats Opus 4.7 and GPT-5.5 and is second only to Opus 4.8; on SWE-Marathon it trails Opus 4.8 by 13%. That's a credible "best open-source on long, multi-step tasks" claim — within its own measurement frame.

Where Ornith's smaller models change the math

The flagship table is the least interesting part of the Ornith story. The reason to care about Ornith is what its mid-size checkpoints do per parameter:

Ornith variant	Type	Terminal-Bench 2.1	SWE-bench Verified	SWE-bench Pro
Ornith-1.0-9B	Dense (edge)	43.1	69.4	42.9
Ornith-1.0-35B	MoE	64.2	75.6	50.4
Ornith-1.0-397B	MoE (flagship)	77.5	82.4	62.2

The 35B MoE scores 64.2 on Terminal-Bench 2.1 — and DeepReinforce reports that beats Qwen3.5-397B's 53.5 on the same test. A 35B model out-scoring a 397B one on Terminal-Bench is the standout result in Ornith's table. Its SWE-bench Verified of 75.6 is also within striking distance of the 397B's 82.4. The 9B edge model hits 69.4 SWE-bench Verified, beating Gemma 4-31B's 52 and matching or exceeding Qwen3.5-35B on several tasks — notable for something you can quantize onto a gaming laptop. GLM-5.2 has no equivalent in this size class; it is one model, and it is large.

What hardware do you actually need to run them?

This is the axis that actually decides the choice for most teams, and it's wildly asymmetric.

Ornith: genuinely local-friendly

The Ornith 9B fits on a single 80GB GPU in full precision; the 35B and 397B MoE checkpoints shard across multi-GPU nodes with tensor parallelism. But the interesting numbers are the quantized 35B builds, which is where most local users will live. Community VRAM figures for Ornith-1.0-35B:

Quant	Loaded VRAM	Notes
Q3_K_M	~17 GB	3.87 BPW, 84.4% top-1 token match vs BF16
Q4_K_M	~21.2 GB	—
Q5_K_M	~24.7 GB	—
Q6_K	~28.5 GB	—

Throughput is the part that surprised people: the 35B runs at roughly 25–35 tok/s on an RTX 4060 8GB laptop with CPU offload, and about 35 tok/s on an NVIDIA DGX Spark. That's usable agent speed on hardware a lot of developers already own. If you've never quantized and served a local model before, our guides on self-hosting LLMs and Ollama vs LM Studio vs vLLM vs llama.cpp vs MLX cover the runtime tradeoffs.

GLM-5.2: not really a self-host model

At ~750B parameters, GLM-5.2 cannot be run locally at full quality on consumer hardware. The viral "run it on 16GB VRAM" claims you'll see on Twitter are Unsloth's 1-bit GGUF — real, impressive engineering, and a real quality loss versus full precision. It also runs on a Mac Studio via llama.cpp. But if your plan is "download the open weights and serve them at production quality on your own box," GLM-5.2 mostly doesn't fit that plan unless you have serious multi-GPU iron. The honest framing: GLM-5.2's openness is about licensing and the API, not about your laptop.

So the hardware story is the cleanest differentiator in this whole comparison. Ornith you self-host. GLM-5.2 you call.

How much does each cost to run?

Here the asymmetry flips into a different shape. Ornith has no first-party hosted API at launch — it is open-weights self-host only. When the official @ornith_ account was asked "where is the API," it replied "chirp chirp! recorded!" — i.e., not yet. So Ornith's cost is entirely your own infrastructure: GPU rental or owned hardware, plus your time. There's no per-token bill, but there's also no managed endpoint to hit on day one.

GLM-5.2 has a public API at GLM-5.1 pricing:

	Ornith-1.0	GLM-5.2
License	MIT	MIT
Hosted API	None at launch	Yes (z.ai)
Input price (per 1M)	—	$1.4
Cached input (per 1M)	—	$0.26
Output price (per 1M)	—	$4.4
Context	256K	1M
Params	9B / 31B / 35B MoE / 397B MoE	~750B MoE
Vision	Not positioned as vision	No (text-only)

$1.4 in / $0.26 cached / $4.4 out per 1M tokens is cheap for a frontier-adjacent coder, especially with the cached-input rate doing heavy lifting in agent loops that re-send the same repo context every step. The cost comparison nets out to a familiar build-vs-buy decision. If you have idle GPUs or you run high enough volume that per-token API costs hurt, Ornith's zero-marginal-cost self-host wins. If you'd rather not operate inference at all, GLM-5.2's API is the only managed option of the two.

How do you actually serve each one?

For Ornith, you bring your own runtime. The flagship 397B with vLLM, tensor-parallel across 8 GPUs:

MODEL=deepreinforce-ai/Ornith-1.0-397B
vllm serve $MODEL \
  --served-model-name Ornith-1.0 \
  --tensor-parallel-size 8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --trust-remote-code

Or with SGLang:

python -m sglang.launch_server \
  --model-path deepreinforce-ai/Ornith-1.0-397B \
  --served-model-name Ornith-1.0 \
  --tp 8 --host 0.0.0.0 --port 8000 \
  --context-length 262144 \
  --mem-fraction-static 0.85 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3

For the dense 9B you drop --tensor-parallel-size entirely (it fits a single 80GB GPU), and the GGUF builds of the 9B and 35B run through Ollama or llama.cpp directly. Note the --reasoning-parser and --tool-call-parser flags in both recipes — those are what keep the <think> blocks and tool calls from leaking into your output. Skip them and you'll get raw reasoning text in your responses.

For GLM-5.2 there's nothing to serve — it's an OpenAI-compatible API:

curl -X POST "https://api.z.ai/api/paas/v4/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{"model":"glm-5.2","messages":[{"role":"user","content":"..."}],"thinking":{"type":"enabled"},"reasoning_effort":"max","max_tokens":4096,"temperature":1.0}'

Or use the official zai-sdk Python package (pip install zai-sdk, then from zai import ZaiClient), or point any OpenAI-compatible client at base URL https://api.z.ai/api/paas/v4/. Either way, set reasoning_effort: "max" — that's the coding-recommended setting. Both models expose OpenAI-compatible endpoints, so wiring either into an agent harness or a custom loop is mechanical once the endpoint is up.

Which model should you pick for which job?

Strip away the benchmark theater and the decision is actually clean, because the two models barely overlap in deployment.

Pick Ornith-1.0 (35B) when:

You want local or edge agentic coding — air-gapped environments, on-prem, privacy-sensitive code that can't leave your network.
You own GPUs (even a single 80GB card, or a 24–32GB consumer card for the quantized 35B) and want zero per-token cost.
You're running an always-on agent loop — scanning a codebase, running tests on a schedule — where per-token API costs would compound, and a self-hosted model's zero marginal cost wins.
You want a model whose smaller checkpoints genuinely outperform their size class, so you can fit "good enough" agentic coding on modest hardware.

Pick GLM-5.2 (max) when:

You want a managed endpoint today and don't want to operate inference at all.
You're doing whole-repo, long-context work — the 1M-token window and IndexShare architecture are built for feeding entire repositories in.
You need the strongest open-source long-horizon performance and you're willing to verify it on your own tasks (Z.ai's FrontierSWE / SWE-Marathon results back this up within their frame).
Cost predictability via API ($1.4/$4.4 per 1M, $0.26 cached) beats the capex and ops of self-hosting for your volume.

Pick neither — or pick carefully — when:

You're doing iterative UI/design work. GLM-5.2 is text-only, so it can't see the rendered output it's iterating on — developers flag this as a real weakness for visual work. Ornith is also not positioned as a vision model. For visual front-end iteration you may still want a multimodal model in the loop.
You need both local AND frontier-flagship quality. Ornith's local-friendly checkpoints aren't the 397B, and the 397B isn't local-friendly. There's no free lunch where one checkpoint gives you both.

If your work straddles both modes, a common pragmatic setup is GLM-5.2 via API for the heavy whole-repo planning and Ornith-35B locally for the high-volume, privacy-sensitive, or always-on grunt work.

What the community signal actually points to

Strip out the hot takes and two grounded signals remain, and they track the deployment split.

Ornith's story is local runnability. The 35B's quantized builds fit consumer GPUs and run at usable agent speed (the VRAM and throughput numbers above are community-measured), which is unusual for a model competing near the flagship tier. There's a catch the discussion keeps returning to: Ornith has no first-party hosted API yet — asked where the API is, the official @ornith_ account replied only "chirp chirp! recorded!" So self-hosting isn't optional today; it's the only way to run it.

GLM-5.2's story is the text-only limitation. Developers doing iterative UI and front-end work flag the lack of vision as a real weakness — the model can't see the rendered result it's iterating on, so visual debugging is awkward. For backend, infrastructure, and CLI work it's a non-issue. And while the open weights are downloadable, running GLM-5.2 locally means heavy quantization (Unsloth's 1-bit GGUF on ~16GB VRAM, or a Mac Studio via llama.cpp) — a real quality trade versus the hosted API.

FAQ

Is Ornith 1.0 or GLM-5.2 better for coding?

At flagship scale they're peers: GLM-5.2 leads Terminal-Bench 2.1 (81.0 vs Ornith-397B's 77.5), they tie on SWE-bench Pro (62.1 vs 62.2), and Ornith reports a strong SWE-bench Verified of 82.4 (GLM's Verified isn't published). The better question is deployment — Ornith you can self-host down to a laptop-class 35B, GLM-5.2 you mostly call via API. Pick by how you want to run it, not by a 3-point benchmark gap.

Can I run GLM-5.2 locally?

Not at full quality on consumer hardware — it's a ~750B-class MoE. Community runs exist via heavy quantization (Unsloth's 1-bit GGUF on ~16GB VRAM, or a Mac Studio via llama.cpp), but those trade real quality for the ability to load it. If you want production-quality GLM-5.2, use the z.ai API. For local-first agentic coding, Ornith-35B is the better fit.

How much does GLM-5.2 cost, and does Ornith have an API?

GLM-5.2's API is $1.4 per 1M input tokens, $0.26 cached input, and $4.4 per 1M output — identical to GLM-5.1. Ornith-1.0 has no first-party hosted API at launch; it's open-weights self-host only (vLLM, SGLang, llama.cpp, Ollama). If you need a managed endpoint today, GLM-5.2 is the only one of the two that offers it.

What hardware runs Ornith-1.0-35B?

The quantized 35B fits common GPUs: ~17 GB VRAM at Q3_K_M, ~21.2 GB at Q4_K_M, ~24.7 GB at Q5_K_M, ~28.5 GB at Q6_K. It runs around 25–35 tok/s on an RTX 4060 8GB laptop with CPU offload, and ~35 tok/s on an NVIDIA DGX Spark. The dense 9B fits a single 80GB GPU in full precision; the 397B needs multi-GPU tensor parallelism.

Are the published benchmark comparisons trustworthy?

Directionally, yes; precisely, no. Each vendor uses its own harnesses and benchmark sets — Ornith runs OpenHands / Terminus-2 / mini-SWE-agent at temperature 1.0, GLM-5.2's blog reports FrontierSWE / PostTrainBench / SWE-Marathon. When one vendor lists the other's score, it's their harness, not the other lab's. Use the numbers to conclude "these are roughly peer-level," then validate on your own tasks before committing.

Does either model support vision?

No. GLM-5.2 is explicitly text-only, which developers flag as a real weakness for iterative UI/design work — it can't see the rendered output it's iterating on. Ornith-1.0 is likewise pitched as a coding/reasoning model, not a multimodal one. For visual front-end iteration you'll still want a multimodal model in the loop.

The honest verdict

Ornith 1.0 vs GLM 5.2 isn't really a benchmark fight — it's a build-vs-buy fork dressed up as one. They're close enough at the top that the leaderboard shouldn't decide it. Choose Ornith-1.0-35B if you want capable agentic coding you fully own and can run locally; choose GLM-5.2 (max) if you want a hosted, long-context, whole-repo coder and would rather call an API than operate inference. Both are MIT, both are free to download, and both are real steps forward for open-weights coding.

Whichever you adopt, the model is the easy part; building the agent harness, evals, and review process around it is where projects actually succeed or stall. If you're scaling AI-assisted development and want senior engineers who've shipped production systems with these tools, Codersera helps teams extend their engineering bench with vetted remote developers — so you can move fast on the agent workflows without carrying all the hiring risk yourself.