Ornith

Ornith 1.0 vs Claude Opus 4.8 for Coding (2026)

Ornith 1.0 is a free, MIT-licensed, self-hostable coding model. Opus 4.8 is the closed frontier flagship. A benchmark-grounded, harness-honest comparison of where each wins on agentic coding in 2026.

Published 30 Jun 2026 • Updated 30 Jun 2026 • 12 min read

Quick answer. Ornith 1.0 (DeepReinforce, MIT-licensed, June 25 2026) and Claude Opus 4.8 (Anthropic, closed, May 28 2026) land close on coding. Opus 4.8 leads the SWE-bench family (~88 vs 82.4 on Verified) and long context (1M tokens); Ornith's 397B flagship is competitive on Terminal-Bench (harness-dependent) and runs free on your own hardware. Opus wins the accuracy benchmarks; Ornith wins on cost and privacy.

On June 25, 2026, DeepReinforce released Ornith 1.0 — a family of open-source models built specifically for agentic coding, shipped under the permissive MIT license in four sizes (9B Dense, 31B Dense, 35B MoE, 397B MoE). A month earlier, on May 28, Anthropic shipped Claude Opus 4.8, its closed frontier flagship with a 1M-token context window and a familiar API price tag. For the first time in a while, an open-weight model is posting coding benchmarks within striking distance of the best closed model on the market — and you can run a usable version of it on a laptop.

That makes for a genuinely interesting head-to-head, but only if you read the numbers honestly. The launch marketing benchmarks Ornith against Claude Opus 4.7, not 4.8. The Terminal-Bench comparison depends heavily on which harness you trust. And the model that posts the headline scores is a 397-billion-parameter behemoth that isn’t practical to run on consumer hardware. This piece separates the real wins from the framing, with every number sourced from the launch materials and independent write-ups.

What is Ornith 1.0?

Ornith 1.0 is a coding-specialized LLM family from DeepReinforce, released as open weights under the MIT license. There are four variants, post-trained on top of Gemma 4 and Qwen 3.5 base models:

Ornith-1.0-9B (Dense) — the edge model, designed to run on modest hardware.
Ornith-1.0-31B (Dense) — a mid-tier dense model.
Ornith-1.0-35B (MoE) — a mixture-of-experts model that activates only ~3B parameters per token, so it punches above its memory footprint.
Ornith-1.0-397B (MoE) — the datacenter flagship that posts the headline benchmark numbers.

The technically interesting part is how it was trained. Most agentic-coding RL pipelines hand the model a fixed, human-designed "scaffold" — the task harness that decides how the model reads files, runs commands, and checks its own work. Ornith's headline contribution is self-scaffolding RL: instead of a fixed harness, the model learns to write its own task scaffold during reinforcement learning. Scaffold and solution rollouts are optimized jointly, and higher-reward scaffolds get selected automatically. In effect, the model learns not just to solve coding tasks but to build the tooling it uses to solve them.

That design choice matters for how you read the benchmarks — more on that below — but the practical upshot is a model line aimed squarely at multi-step, tool-using coding work rather than single-shot completion. The official launch tweet from @ornith_ pulled 6,705 likes and over 5 million views, and GGUF quantizations for the 9B and 35B were on Hugging Face within days. If you want the broader context on where this sits, our open-source LLM landscape guide tracks the wider field.

What is Claude Opus 4.8?

Claude Opus 4.8 is Anthropic's closed-weight frontier model, released May 28, 2026. It is not a coding-specialized model — it's a general-purpose flagship that happens to be excellent at code, and it tops several of the public coding leaderboards. Its headline specs:

Context window: 1M tokens of input, 128K tokens of output.
Pricing: $5 per million input tokens and $25 per million output tokens on the standard tier; a fast mode runs $10/$50 at roughly 2.5x speed.
General reasoning: 93.6 on GPQA Diamond, a hard science-reasoning benchmark, alongside its coding scores.

You consume Opus 4.8 through Anthropic's API, Claude Code, or the various IDE integrations — there are no weights to download, no self-hosting, and no way to keep your code off a vendor's servers. In exchange you get the stronger model on the public SWE-bench numbers, with the longer context window of the two by a wide margin. For setup, agent integration, and the full spec sheet, see our Claude Opus 4.8 launch guide.

How do Ornith 1.0 and Claude Opus 4.8 compare on coding benchmarks?

Here's the head-to-head on the benchmarks both camps report, using Ornith's 397B flagship as the open-source representative and noting harness caveats where they exist. All figures are self-reported by the respective vendors or aggregated by llm-stats; treat them as launch-day claims, not independently audited results.

Benchmark	Ornith-1.0-397B	Claude Opus 4.8	Claude Opus 4.7
SWE-bench Verified	82.4	88.6 (87.6 per DeepReinforce table)	80.8
SWE-bench Pro	62.2	69.2	64.3
SWE-bench Multilingual	78.9	84.4	—
Terminal-Bench 2.1	77.5	74.6 (Terminus-2) / 85 (Ornith table)	70.3
NL2Repo	48.2	—	—
ClawEval (avg)	77.1	—	78.2

Read top to bottom, the pattern is consistent: Opus 4.8 wins the SWE-bench family clearly. On SWE-bench Verified, the most-cited agentic-coding benchmark, Opus 4.8 sits around 88 versus Ornith-397B's 82.4 — a gap of roughly 5 to 6 points. On SWE-bench Pro (a harder, more contamination-resistant variant) it's 69.2 versus 62.2. On SWE-bench Multilingual it's 84.4 versus 78.9. These are not rounding errors; on raw patch-correctness accuracy, the closed flagship leads.

The exception — and the one Ornith's marketing leans on hardest — is Terminal-Bench 2.1, where Ornith-397B posts 77.5. That clearly beats Opus 4.7's 70.3. Whether it beats Opus 4.8 is the contentious part, and it deserves its own section.

Why the Terminal-Bench numbers don't line up

This is the single most important caveat in the entire comparison, so don't skip it. Ornith-397B scores 77.5 on Terminal-Bench 2.1. For Opus 4.8 on the same benchmark, you'll see two very different numbers depending on the source:

74.6 — measured on the public Terminus-2 harness, as reported by llm-stats.
85 — listed in DeepReinforce's own comparison table.

Those can't both be right for the same model, and the discrepancy is a harness problem, not a model problem. Terminal-Bench scores depend enormously on the scaffold wrapped around the model — the tool definitions, the retry logic, the way the agent loop is structured. Recall that Ornith's whole training method is self-scaffolding: it optimizes its own harness during RL. That means a cross-table Terminal-Bench comparison where Ornith runs its learned scaffold and Opus runs a generic public harness is close to apples-to-oranges. You're partly measuring the harness, not the model.

The honest read: if you trust the public Terminus-2 number (74.6), Ornith's 77.5 edges out Opus 4.8 on this one benchmark. If you trust DeepReinforce's own table (85), Opus 4.8 wins it comfortably. Either way, the result is harness-sensitive enough that it should not anchor your decision. Treat Terminal-Bench as "Ornith is competitive on terminal-driven agentic tasks," not "Ornith beats the frontier." For background on how agent harnesses shape real-world results, our AI coding agents guide goes deeper.

What about the smaller, laptop-runnable Ornith models?

The 397B flagship posts the headline scores, but it needs server-class, multi-GPU hardware. It's not a laptop model. The genuinely interesting story for most developers is the smaller end of the family, where Ornith's claims get surprising.

Model	Type	SWE-bench Verified	Terminal-Bench 2.1	Runs locally?
Ornith-1.0-9B	Dense	69.4	43.1	Yes — edge hardware
Ornith-1.0-35B	MoE (~3B active)	—	64.2	Yes — 8GB GPU + 32GB RAM
Ornith-1.0-397B	MoE	82.4	77.5	No — multi-GPU server
Claude Opus 4.8	Closed	88.6	74.6	No — API only

Two things stand out. First, the 9B edge model scores 69.4 on SWE-bench Verified — well below the frontier, but reportedly beating much heavier 30B-plus models like Gemma 4-31B and Qwen 3.6-35B. For a 9B model you can run on modest hardware, that's a strong showing on a benchmark where most small models collapse.

Second, the 35B MoE is the sweet spot for local agentic coding. Because it's a mixture-of-experts that activates only ~3B parameters per token, it runs fast for its size. A Reddit user on r/LocalLLM reported the 35B GGUF Q4 build running at 25-35 tokens/second on an ASUS laptop with an 8GB RTX 4060 and 32GB of RAM via llama.cpp. Developer Alex Finn (@AlexFinn) shared a similarly positive, widely-shared hands-on take on the 35B running locally.

The 31B Dense model sits between the two and is the least-benchmarked of the family at launch; if you have the VRAM for it but not for the 397B, it's worth testing, but the MoE-vs-dense efficiency math means the 35B MoE will usually be faster for comparable quality. For most local users the decision is simply 9B (tight hardware) or 35B MoE (the default), with the 31B and 397B as edge cases at opposite ends of the hardware spectrum.

That's the real headline for self-hosters: not "open source matches the frontier" (it doesn't, on SWE-bench), but "you can get frontier-ish agentic coding running locally on a consumer laptop, for free." If that's your goal, pair this with our roundup of the best free local LLM tools.

What does it cost — free local vs the Opus API?

This is where the comparison stops being close and starts being lopsided — in Ornith's favor. Opus 4.8 is metered per token; Ornith is free to run.

Cost dimension	Ornith 1.0	Claude Opus 4.8
License	MIT (open weights)	Closed / proprietary
Per-token price	$0 — self-hosted	$5/M input, $25/M output (standard)
Fast tier	n/a	$10/M in, $50/M out (~2.5x speed)
Marginal cost of a run	Electricity only	Scales with every token
Up-front cost	Hardware you already own / buy once	None
Privacy	Code never leaves your machine	Code goes to Anthropic's API

The economics flip depending on volume. For occasional use, the Opus API is effectively free at the margin — a few dollars a month, no hardware to buy or maintain. But agentic coding is token-hungry: a single long-horizon task that reads a repo, runs tests, and iterates can burn hundreds of thousands of output tokens, and output is the expensive side at $25/M. If you're running agents in a loop — the always-on, overnight pattern more developers are moving toward — those tokens compound fast, and a self-hosted model that costs nothing per token becomes very attractive.

A rough worked example makes the crossover concrete. Say one agentic task averages 200K input tokens (repo context, file reads, test output fed back over many turns) and 50K output tokens. On Opus 4.8 standard pricing that's $1.00 of input plus $1.25 of output — about $2.25 per task. Run 30 of those a day and you're near $67/day, or roughly $2,000/month per developer running agents heavily. The same workload on a self-hosted 35B MoE costs the electricity to keep an 8GB GPU busy — cents per day. The Opus math is fine for light or occasional use; it's the always-on, loop-it-overnight pattern that makes the free local model compelling. (Treat these figures as illustrative — your token mix will differ — but the shape holds: API cost scales with usage, self-hosted cost doesn't.)

The privacy axis is binary and often decides it on its own. With Ornith, your proprietary code never leaves your machine. With Opus, every file the agent reads goes to a third-party API. For regulated industries, security-sensitive codebases, or teams that simply can't send source to a vendor, that single fact can make the open model the only option regardless of the benchmark gap. Our self-hosting LLMs guide covers the operational side of running a private model in earnest.

How do you run Ornith 1.0 locally?

Day-one tooling support was unusually good for an open-weight launch. Ornith runs in Ollama, llama.cpp, vLLM, and Unsloth out of the box, and GGUF quantizations (including bartowski's quants) for the 9B and 35B were on Hugging Face within days.

The simplest path is Ollama — pull and run the size you want:

ollama run ornith:35b   # 35B MoE — the local default
ollama run ornith:9b    # 9B — for tighter hardware

If you'd rather drive the quantized GGUF directly — the route the r/LocalLLM benchmark used to hit 25-35 tok/s on an 8GB RTX 4060 — grab the Q4 build and run it under llama.cpp:

# 35B MoE, Q4 quant, via llama.cpp
# model file: deepreinforce-ai_Ornith-1.0-35B-Q4_K_L.gguf

For most laptops and single-GPU desktops, the 35B MoE at Q4 is the variant to start with — it's the best balance of quality and speed for local agentic coding. The 9B is the fallback for tighter hardware. If you're choosing a local runtime, our breakdown of Ollama vs LM Studio vs vLLM vs llama.cpp vs MLX will help you pick the right one for your machine.

Where does each model actually win?

Strip away the launch framing and the decision comes down to a handful of clear tradeoffs.

Claude Opus 4.8 wins on:

Raw SWE-bench accuracy. A 5-6 point lead on SWE-bench Verified and clear wins on Pro and Multilingual. On hard, real-world patch correctness, it's the more reliable model.
Long context. 1M input tokens versus Ornith's 256K. For whole-repo reasoning or massive context dumps, Opus has a structural advantage.
Consistent benchmark accuracy. Opus 4.8 leads all three SWE-bench variants reported here — Verified, Pro, and Multilingual — so across the most-cited agentic-coding benchmarks it's the more accurate model.
Zero ops. No GPUs, no quants, no driver versions — just an API key.

Ornith 1.0 wins on:

Cost. Free, MIT-licensed, zero per-token charge. For high-volume or always-on agent loops, this is decisive.
Privacy and control. Your code never leaves your hardware. Open weights mean you can fine-tune, audit, and deploy without vendor lock-in.
Local feasibility. The 35B MoE delivers usable agentic coding at 25-35 tok/s on an 8GB consumer GPU. Nothing closed comes close to that on-device.
Terminal-driven tasks. Competitive-to-leading on Terminal-Bench (with the harness caveat), thanks to self-scaffolding.

A reasonable default for many teams is to run both: the 35B MoE locally for the bulk of iterative, private, token-heavy agent work, and Opus 4.8 via API for the hard cases where you need the extra accuracy or the million-token context. The benchmark gap is small enough that the open model handles most day-to-day coding, and the closed model is there for the 10% that genuinely needs the frontier. If you're weighing other open contenders too, our GLM 5.2 vs Opus 4.8 comparison covers an adjacent matchup.

A note on trusting these benchmarks

Everything above rests on launch-day numbers, and you should treat them with appropriate skepticism. Three caveats worth holding onto:

First, the benchmarks are self-reported. DeepReinforce published its own scores at launch, and early community reaction on r/LocalLLaMA was notably wait-and-see — skeptical of the SOTA-adjacent claims until independent runs confirm them. Independent third-party runs hadn't landed at the time of writing. Until they do, read the SOTA-adjacent claims as unverified.

Second, the marketing benchmarks Opus 4.7, not 4.8. Ornith's headline comparisons ("Terminal-Bench 77.5 vs Claude Opus 4.7's 70.3") use the older flagship. Against the current Opus 4.8, the SWE-bench gap reopens to 5-6 points. Don't let the 4.7 framing imply Ornith beats the current frontier across the board — it doesn't.

Third, SWE-bench Verified is Python-heavy, and no benchmark predicts performance on your specific codebase. A model that tops SWE-bench may stumble on your TypeScript monorepo or your legacy C++ service. The only benchmark that matters is the one you run on your own tasks — so before you commit either way, try both on a representative slice of your actual work.

The fair summary: Ornith 1.0 is a remarkable open-source release that closes most of the gap on agentic coding for a free, self-hostable model. It is not "equal to the frontier." Opus 4.8 still wins on raw SWE-bench accuracy and long context. But "closes most of the gap, runs free on a laptop, keeps your code private" is a genuinely new thing in mid-2026, and for a large share of real coding work it's enough.

FAQ

Is Ornith 1.0 better than Claude Opus 4.8 for coding?

Not on raw accuracy. Opus 4.8 leads SWE-bench Verified by roughly 5-6 points (~88 vs 82.4 for Ornith's 397B flagship) and wins SWE-bench Pro and Multilingual. Ornith is competitive on Terminal-Bench (77.5, harness-dependent) and wins decisively on cost and privacy because it's free and self-hostable. "Better" depends on whether you're optimizing for peak accuracy or for free, private, local execution.

Can I run Ornith 1.0 on a laptop?

Yes — the smaller variants. The 35B MoE (Q4 GGUF) runs at 25-35 tokens/second on an 8GB RTX 4060 laptop with 32GB RAM via llama.cpp, and the 9B runs on lighter hardware. The 397B flagship that posts the headline benchmarks needs datacenter-class multi-GPU hardware and is not laptop-runnable.

How much does Claude Opus 4.8 cost versus Ornith?

Opus 4.8 costs $5 per million input tokens and $25 per million output tokens on the standard tier (fast mode is $10/$50 at ~2.5x speed). Ornith is free under the MIT license — your only cost is the hardware and electricity to run it. For high-volume or always-on agent loops, the self-hosted model is dramatically cheaper at the margin.

What is self-scaffolding RL?

It's Ornith's training method. Instead of using a fixed, human-designed task harness during reinforcement learning, the model learns to write its own scaffold — the tooling it uses to read files, run commands, and check its work. Scaffold and solution are optimized jointly, with higher-reward scaffolds selected automatically. It also explains why Ornith's Terminal-Bench numbers are hard to compare directly against models running a generic public harness.

Why are the Terminal-Bench scores inconsistent?

Because Terminal-Bench results depend heavily on the harness wrapped around the model. Opus 4.8 scores 74.6 on the public Terminus-2 harness but is listed at 85 in DeepReinforce's own table. Since Ornith optimizes its own harness via self-scaffolding, any cross-table Terminal-Bench comparison is partly measuring the scaffold, not the model. Always check which harness a Terminal-Bench number used.

Should I switch from Opus to Ornith?

For most teams, run both rather than switching outright: the 35B MoE locally for high-volume, private, iterative agent work, and Opus 4.8 via API for the hardest cases needing top accuracy or its 1M-token context. If privacy or per-token cost is your binding constraint, Ornith can be the primary model. If peak reliability on complex tasks is what you need, keep Opus in the loop.

Whichever way you lean, the right move is to test both on a representative slice of your own codebase before committing. If you'd rather have senior engineers who already work fluently with both open and closed coding models build alongside your team, Codersera connects you with vetted remote developers who can extend your engineering team without the hiring overhead.