Ornith 1.0 vs Qwen 3.7: Best Local Coding Model in 2026?
On 25 June 2026, a relatively small lab called DeepReinforce dropped Ornith 1.0 on Hugging Face: an MIT-licensed family of agentic coding models that, by its own numbers, posts 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1 — enough to claim it surpasses Claude Opus 4.7. The launch thread on r/LocalLLaMA drew 360 upvotes. The headline that grabbed everyone, though, was the lineage: Ornith is post-trained on top of Qwen 3.5 and Gemma 4. A community-driven RL pass on Alibaba's open base, beating Alibaba's own flagship.
That flagship is Qwen 3.7 — and here is where the popular framing of "two open coding models" falls apart. Qwen 3.7 is not open. The 3.7 line ships only as a closed API (Qwen3.7-Max and Qwen3.7-Plus on Alibaba Cloud Model Studio). The last open Qwen was 3.6. So this comparison is not local-vs-local. It is the open model you run on your own GPUs versus the closed model you rent by the token. This guide compares the two and helps you decide which to use for agentic and local coding work, with every benchmark, variant, VRAM figure, and price drawn from the primary sources.
What shipped, and why the lineage matters
Two releases, two very different distribution models:
- Ornith 1.0 — DeepReinforce's open-weights family, announced 25 June 2026, MIT-licensed, described as a "self-improving" agentic-coding system. It is post-trained on pretrained Qwen 3.5 and Gemma 4 bases using reinforcement learning, then released across four declared sizes — three with published checkpoints at launch, the fourth announced but not yet posted (see the variant note below) — each with a 256K context window and an OpenAI-compatible interface. For the published sizes you can download the weights and run them.
- Qwen 3.7 — Alibaba's flagship series, available as closed API endpoints only.
Qwen3.7-Maxis the public tier: agent-centric, text-in/text-out, a 1,000,000-token context window, and prompt caching. You call it; you do not host it.
The lineage twist is worth pausing on because it is easy to garble. Ornith's base is Qwen 3.5 (and Gemma 4). Ornith's comparison target is Qwen 3.7-Max. So DeepReinforce took the older, open Qwen generation, ran its own RL recipe on it, and — on its own evaluation harness — came out ahead of the newer, closed Qwen. If you want the foundation models in this story, our guides to Qwen 3.5 and Gemma 4 cover the bases Ornith is built from.
One caveat to keep front of mind for the rest of this article: every Ornith benchmark below is DeepReinforce's own self-reported number — their harness, their settings, their evaluation choices. None of it is independent SWE-Bench leaderboard verification yet. Treat the scores as a vendor claim with a plausible methodology, not as third-party fact. That skepticism is the right default for any model that launches with a chart showing it beating Claude.
What is Ornith 1.0?
Ornith 1.0 is a reasoning-first agentic coding model. Assistant turns open with a <think> block before the model emits a tool call or an answer, which means you have to serve it with the right reasoning and tool-call parsers or your agent loop breaks (more on that below). It ships in four declared sizes, all sharing the 256K (262,144-token) context window and an OpenAI-compatible API surface:
| Variant | Type | Context | Local footprint | GGUF? | Best for |
|---|---|---|---|---|---|
| Ornith-1.0-9B | Dense | 256K | bf16 fits one 80GB GPU; GGUF quant for lighter cards | Yes | Single-GPU / edge |
| Ornith-1.0-31B | Dense | 256K | Announced — no published checkpoint yet | TBD | (see note) |
| Ornith-1.0-35B | MoE | 256K | Multi-GPU; runs as Q4 GGUF on an 8GB+32GB laptop at 25-35 t/s | Yes | Laptop / prosumer |
| Ornith-1.0-397B | MoE | 256K | Sharded across multiple GPUs (tensor-parallel); FP8 build ~halves VRAM | No | Datacenter / flagship |
A discrepancy to flag honestly: the launch blog and a widely-shared Reddit terminology guide both list the 31B Dense variant, but the official GitHub checkpoint table only documents the 9B Dense, 35B MoE, and 397B MoE. As of launch there is no published 31B Dense checkpoint link. Treat 31B as announced but possibly-unreleased — do not plan a deployment around it until Hugging Face actually lists the weights.
The MIT license is the genuinely important fact for teams. It permits commercial use, modification, and redistribution with effectively no strings — a cleaner footing than many "open" model licenses that carve out usage restrictions. For background on why that matters when you self-host, see our self-hosting LLMs guide.
Why "Qwen 3.7" is not a model you run locally
This is the correction the "two open models" framing needs. Alibaba did not open-weight the 3.7 line. On Alibaba Cloud Model Studio, qwen3.7-max and qwen3.7-plus are listed as closed API models. There is no weights download, no GGUF, no Ollama pull. The r/LocalLLaMA thread titled "Qwen is never going to open source Qwen 3.7" hit 544 upvotes and lays out the consensus: the big 3.7 models are locked down, and Qwen 3.6 is the last open release. One Twitter exchange captured the consequence neatly — a user complained that no finetuned fork of Qwen 3.7 beats the original, and the reply pointed out the obvious: there are no community forks at all, because there are no weights to fork.
So if your requirement is "runs on my hardware," the honest local-vs-local matchup is Ornith 1.0 versus Qwen 3.6, not Qwen 3.7. If you specifically want the open Qwen generation on your own box, follow our how to run Qwen 3.6 locally walkthrough. The "3.7" in this comparison is always the API.
What you get for that API: Qwen3.7-Max is the flagship of the 3.7 series, text-in/text-out, positioned for agentic work — coding, office tasks, long-horizon autonomy. The standout spec is the 1,000,000-token context window, roughly 4x (about 3.8x) Ornith's 256K, plus prompt caching to cut cost on repeated context. Pricing per OpenRouter is $1.25 per 1M input tokens and $3.75 per 1M output tokens, with cache reads at $0.25/M. If you want the full rundown of the launch, our Qwen 3.7-Max launch guide goes deeper.
How do Ornith 1.0 and Qwen 3.7 compare on coding benchmarks?
Here is a selected comparison from DeepReinforce's launch post — Ornith-397B against Qwen3.7-Max, with two Claude Opus columns for orientation. Again: these are Ornith's own evals, so the Ornith-vs-Qwen comparison is run on DeepReinforce's harness. The Claude Opus columns are included because Ornith positions itself against them directly.
| Benchmark | Ornith-1.0-397B | Qwen3.7-Max | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 77.5 | 73.5 | 70.3 | 85 |
| Terminal-Bench 2.1 (Claude Code) | 78.2 | 69.8 | — | — |
| SWE-Bench Verified | 82.4 | 80.4 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 60.6 | — | — |
| SWE-Bench Multilingual | 78.9 | 78.3 | — | — |
| NL2Repo | 48.2 | 47.2 | — | — |
| ClawEval | 77.1 | 65.2 | — | — |
Read this carefully and three things stand out.
First, Ornith-397B beats Qwen3.7-Max on every coding row shown — but the margins are mostly thin. SWE-Bench Verified is 82.4 vs 80.4, a two-point edge. SWE-Bench Multilingual is 78.9 vs 78.3, basically noise. NL2Repo is 48.2 vs 47.2. On a vendor's own harness, a one-to-two point lead is not a decisive win; it is a tie you would not bet money on reproducing independently. The genuine gaps are Terminal-Bench Claude Code (78.2 vs 69.8) and ClawEval (77.1 vs 65.2) — both agentic/tool-use-heavy evals, which is exactly where an RL post-train optimized for agent loops should show up.
Second, the Claude comparison is honest about its own ceiling. Ornith claims 397B matches or surpasses Claude Opus 4.7 (70.3 TB-2.1 / 80.8 SWE-Verified) — and the numbers back that on these two rows. But it openly shows that it still trails Claude Opus 4.8 (85 TB-2.1 / 87.6 SWE-Verified) by a wide margin, and trails GLM-5.2-744B on Terminal-Bench too (GLM posts 81.0/82.7). So the accurate framing is "open model that reached last-generation frontier-closed quality," not "open model that beat the current frontier." If you are weighing the current closed frontier, our Claude Opus 4.8 launch guide has the full picture.
Third, context window is the one spec where Qwen 3.7 simply wins. Ornith caps at 256K; Qwen3.7-Max offers 1M. For very large monorepos or long-horizon agent runs that need to hold an entire codebase in context, that roughly 4x headroom is a real capability advantage no benchmark row captures.
Which Ornith variant should you run locally?
This is where the open-vs-closed split actually pays off: you get to pick the size that fits your hardware. The smaller Ornith variants are surprisingly strong.
Ornith-1.0-9B (Dense) scores 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified, and DeepReinforce claims it matches or exceeds Gemma 4-31B and Qwen 3.6-35B despite being a fraction of their size. In bf16 it fits on a single 80GB GPU, and a GGUF quant brings it comfortably onto smaller cards. This is the "one good GPU" tier.
Ornith-1.0-35B (MoE) is the sweet spot for most developers. It posts 64.2 on Terminal-Bench 2.1 and 75.6 on SWE-Bench Verified — the blog text actually cites 64.4 vs the table's 64.2, a minor internal inconsistency worth noting — and DeepReinforce claims it beats Qwen3.5-35B, Qwen3.6-35B, and even Qwen3.5-397B on Terminal-Bench. Because it is a Mixture-of-Experts design, only a slice of the 35B total parameters activates per token, which is what keeps it laptop-viable.
The proof is a real datapoint from r/LocalLLM: a user ran Ornith-1.0-35B GGUF at Q4_K_L (bartowski's quant) on an ASUS TUF laptop — an RTX 4060 with 8GB VRAM and 32GB system RAM — via a llama.cpp server build (b9672) and got 25-35 tokens/second. A 35B-total-parameter model running at usable speed on an 8GB laptop GPU is the headline practical result of the whole release. MoE sparsity plus aggressive quantization makes it work.
Ornith-1.0-397B (MoE) is the flagship and a different proposition entirely. It needs to be sharded across multiple GPUs with tensor parallelism (the published serve recipe uses eight), and an FP8 build roughly halves the VRAM. There is no GGUF for it; this is datacenter or rented-cluster territory. For most teams, calling 397B through a hosted endpoint (or just using Qwen3.7-Max's API) is more economical than standing up the hardware.
If you are deciding how to host any of these, our comparison of Ollama vs LM Studio vs vLLM vs llama.cpp vs MLX maps each runtime to a use case.
How do you serve Ornith 1.0?
Ornith exposes an OpenAI-compatible interface, so once it is running, your existing client code points at it unchanged. The catch is the reasoning model setup: because every assistant turn emits a <think> block, you must enable the reasoning and tool-call parsers, or your agent's tool calls and chain-of-thought parsing will break.
Minimum runtimes: Transformers >=5.8.1, vLLM >=0.19.1, or SGLang >=0.5.9. Recommended sampling is temperature 0.6, top_p 0.95, top_k 20.
Serving the 397B flagship via vLLM (drop --tensor-parallel-size for the 9B, which fits one 80GB GPU):
MODEL=deepreinforce-ai/Ornith-1.0-397B
vllm serve $MODEL \
--served-model-name Ornith-1.0 \
--tensor-parallel-size 8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--trust-remote-codeOr via SGLang:
MODEL=deepreinforce-ai/Ornith-1.0-397B
python -m sglang.launch_server \
--model-path $MODEL \
--served-model-name Ornith-1.0 \
--tp 8 \
--host 0.0.0.0 --port 8000 \
--context-length 262144 \
--mem-fraction-static 0.85 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3Note the tool-call parsers carry Qwen names — qwen3_xml and qwen3_coder — a direct fingerprint of the Qwen 3.5 lineage. For the laptop path, pull a GGUF quant such as bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF (file Ornith-1.0-35B-Q4_K_L.gguf) and serve it with a recent llama.cpp build (b9672 or newer) or Ollama. The full checkpoint matrix — bf16, FP8, and GGUF — lives in the Hugging Face collection.
Calling Qwen 3.7, by contrast, is just an API key and a base URL pointed at Alibaba Cloud Model Studio or OpenRouter — no GPU, no parsers, no runtime version pinning. That operational simplicity is the whole pitch of the closed model. If you are wiring either into an agent harness, our AI coding agents guide covers the orchestration layer.
What do developers actually say?
The early reception splits along predictable lines, and the social signal is more useful than the benchmark table for setting expectations.
On Ornith, the local-LLM crowd is enthusiastic but measured. The Hugging Face launch thread on r/LocalLLaMA hit 360 upvotes — strong top-of-feed interest. The standout demo came from a developer on X (@anshuc) who showed off a shader-art piece built on a fully local Ornith instance (around 1,861 views). That captures the genuine appeal: a model good enough for real generative coding tasks, running entirely on your own machine, with no per-token meter. The louder marketing claims (one tweet shouted that Ornith "outscored Claude Opus 4.7" with its 82.4 on SWE-Bench Verified) are circulating too, but they are just echoing the vendor's launch chart — discount accordingly.
On Qwen3.7-Max, the verdict is genuinely polarized. On r/opencode, one thread titled "Didn't expect Qwen 3.7 Max to be this good!" (142 upvotes) praised it warmly, while a near-simultaneous thread titled "Qwen 3.7 Max is extremely stupid" (143 upvotes) pushed back just as hard. That split is the real signal: a benchmark-strong agentic model that some users find sharp and others find unreliable, and the well-known risk of any autonomous editor is that it can take destructive actions inside a real repo when run unsupervised. Weigh that against the headline scores. The fair read of Qwen3.7-Max is that it shines as a high-level reasoning and review layer and is riskier as an unsupervised autonomous editor.
Which should you choose: run Ornith locally or call Qwen 3.7 via API?
The decision is less "which scores higher" and more "which distribution model fits your constraints," because on raw coding quality these two are within a couple of points of each other on the only table that compares them directly.
Run Ornith 1.0 locally if:
- You need data to stay on your hardware — regulated industry, sensitive codebase, or just principle. This is Ornith's single biggest advantage and Qwen 3.7 cannot match it.
- You want zero marginal cost per token after the hardware is paid for. That economics is real for high-volume agent loops.
- You have at least a capable laptop (8GB VRAM + 32GB RAM) for the 35B GGUF, or one 80GB GPU for the 9B, or a multi-GPU box for 397B.
- You want MIT-licensed weights you can modify, fine-tune, and ship commercially without restriction.
Call Qwen 3.7-Max via API if:
- You need the 1M-token context window — roughly 4x Ornith's 256K — for very large repos or long-horizon autonomous runs.
- You want zero operational overhead: no GPUs, no parser flags, no runtime version pinning, no quantization decisions.
- Your volume is low enough that $1.25/$3.75 per 1M tokens beats owning hardware — at those rates a million mixed tokens runs only a couple of dollars, so intermittent use is cheap.
- You are using it as a reasoning/review layer rather than an unsupervised autonomous editor.
A pragmatic hybrid that several teams will land on: run Ornith-35B locally for the bulk of day-to-day coding and private work, and reach for Qwen3.7-Max's API only when you need its 1M context or a high-level architectural review. The two are complementary more than they are rivals — one is your always-on local workhorse, the other a metered specialist you call when the job demands it.
If your constraint is the opposite — you want the strongest possible closed model and cost is secondary — the benchmarks point past both of these to Claude Opus 4.8 and GLM-5.2; our GLM-5.2 vs Claude Opus 4.8 coding comparison covers that tier.
The honest bottom line
Ornith 1.0 is a genuinely notable release: a small lab took Alibaba's open Qwen 3.5 base, ran a reinforcement-learning post-train, and produced an MIT-licensed model that — on its own evals — trades blows with the current closed Qwen flagship and reaches last-generation Claude territory, all while running on a gaming laptop in its 35B form. That is the open-weights ecosystem working exactly as it should.
But keep two asterisks attached. The benchmarks are self-reported, the margins over Qwen3.7-Max are slim on most rows, and Ornith openly trails the actual current frontier (Opus 4.8, GLM-5.2). And the comparison is structurally apples-to-oranges: Ornith is open and local, Qwen 3.7 is closed and API-only. The most accurate one-liner is not "Ornith beats Qwen 3.7" — it is "Ornith gives you frontier-adjacent coding quality you can actually own, and Qwen 3.7 gives you a larger-context managed service you rent." Pick on distribution, privacy, and context needs, not on a two-point benchmark gap.
If you are scaling AI-assisted development and want engineers who can stand up local-LLM infrastructure, wire models into agent harnesses, and ship production code around them, Codersera connects teams with vetted remote developers who do exactly this kind of work.
FAQ
Is Qwen 3.7 open source like Ornith 1.0?
No. Qwen 3.7 is closed and API-only — the public tiers are Qwen3.7-Max and Qwen3.7-Plus on Alibaba Cloud Model Studio, with no downloadable weights. The last open-weight Qwen release was 3.6. Ornith 1.0, by contrast, is MIT-licensed open weights you can download and run locally. If you want an open Qwen to self-host, you need Qwen 3.6, not 3.7.
Is Ornith 1.0 actually based on Qwen 3.7?
No — this is the common confusion. Ornith 1.0 is post-trained on Qwen 3.5 and Gemma 4 bases, not Qwen 3.7. Qwen 3.7 is its comparison target, not its foundation. The notable part of the story is that Ornith took the older, open Qwen 3.5 generation and, on DeepReinforce's own benchmarks, came out ahead of the newer closed Qwen 3.7-Max on every coding row.
Can I run Ornith 1.0 on a laptop?
Yes, the 35B MoE variant. A documented r/LocalLLM run had Ornith-1.0-35B GGUF at Q4_K_L (bartowski's quant) on an RTX 4060 (8GB VRAM) plus 32GB system RAM via llama.cpp build b9672, hitting 25-35 tokens/second. Mixture-of-Experts sparsity is what keeps a 35B-total model viable on 8GB. The 9B Dense is even lighter; the 397B flagship needs a multi-GPU server.
How much does Qwen 3.7-Max cost?
Per OpenRouter, Qwen3.7-Max is $1.25 per 1M input tokens and $3.75 per 1M output tokens, with cached input reads at $0.25/M. It offers a 1,000,000-token context window. At those rates a million mixed input/output tokens costs only a couple of dollars, so intermittent architect-style use is inexpensive; sustained high-volume agent loops are where the cost adds up and local models pull ahead.
Are Ornith's benchmark numbers independently verified?
No. Every Ornith score — the 82.4 SWE-Bench Verified, 77.5 Terminal-Bench 2.1, and the full comparison table — is DeepReinforce's own self-reported result, run on their own harness and settings. There is no independent SWE-Bench leaderboard verification yet. Treat the numbers as a credible vendor claim, not third-party fact, and expect some regression on independent re-runs.
Which is better for agentic coding, Ornith or Qwen 3.7?
On DeepReinforce's table, Ornith-397B leads Qwen3.7-Max on the agentic rows by the widest margins — Terminal-Bench Claude Code 78.2 vs 69.8 and ClawEval 77.1 vs 65.2 — which suggests its RL post-train genuinely helped tool-use loops. But Qwen3.7-Max's 1M context is better for very large repos, and some users find it unreliable when run fully unsupervised. For private, high-volume agent work, Ornith local wins; for large-context or hands-off-but-supervised reasoning, Qwen 3.7's API is the easier call.