Ornith

How to Run Ornith 1.0 Locally

Ornith 1.0 is DeepReinforce's open-source, self-scaffolding family of agentic coding models, post-trained on Qwen 3.5 and Gemma 4. This guide shows how to run each variant locally - 9B on a laptop, 35B MoE on a 24GB card, 397B on an 8-GPU box - with Ollama, LM Studio and vLLM, plus agent settings.

Published 30 Jun 2026 • Updated 30 Jun 2026 • 17 min read

Ornith 1.0 is DeepReinforce's open-source, MIT-licensed, self-scaffolding family of agentic coding models. The fastest path is Ollama: install it, then run ollama run ornith:9b on a laptop or ollama run ornith:35b on a 24 GB GPU. Use LM Studio for a GUI, or vLLM to serve the 35B-FP8 and 397B from Hugging Face. Two sizes are one-command runnable; the 31B Dense is announced but has no public checkpoint yet.

DeepReinforce shipped Ornith 1.0 on June 25, 2026, and it is one of the more interesting open-weights releases of the year — not because it sweeps every leaderboard (it doesn't), but because of how it was trained. Ornith is a family of agentic coding models under the permissive MIT license, published on Hugging Face under the deepreinforce-ai org and listed on Ollama within days. There are four announced sizes; two of them you can pull and run in about five minutes if you already have a local stack.

This is the practical guide: which variant fits your hardware, the exact commands to run each under Ollama, LM Studio, and vLLM, what the benchmarks actually say once you strip the launch-day overclaiming, and how to wire a locally served Ornith into a real coding agent — including the tool-call and reasoning-parser settings that people are tripping over in the first week. If you set up other open models recently, little here will surprise you; local-LLM tooling has matured to where a new model is mostly a new tag. If you are newer to it, our roundup of the best free local LLM tools in 2026 is a good companion.

What is Ornith 1.0, and why the training matters

Most "agentic" coding models are a base LLM plus a human-designed harness: someone hand-writes the scaffolding — the prompt templates, the tool-call format, the plan-then-act loop, the retry logic — and the model is trained to operate inside it. The scaffold is fixed and external. Ornith's headline idea is to stop treating the scaffold as a separate, hand-built artifact.

Ornith is self-scaffolding. During reinforcement learning, the process jointly produces both the solution rollouts and the task-specific scaffolds that guide them. Per DeepReinforce's technical write-up, each RL step runs in two stages: conditioned on a task and the scaffold previously used for it, the model first proposes a refined scaffold; then, conditioned on that scaffold, it generates a solution rollout. The reward from the rollout flows back to both stages, so the model is optimized not just to answer well but to author the orchestration that elicits the answer. Over training this becomes a feedback loop in which scaffolds are continually mutated and selected toward the ones that produce higher-reward trajectories — per-task strategies emerge automatically, without a hand-engineered harness. That is the part DeepReinforce is actually selling, and it is the reason the model behaves more like a coding agent and less like a chatbot that happens to write code.

Letting a model write its own scaffold invites reward hacking — a self-authored harness can learn to satisfy the verifier without doing the task (reading visible test files, hardcoding expected outputs, copying an oracle solution in the environment). DeepReinforce says it defends against this in three layers: a fixed outer trust boundary (the environment, tool surface, and test isolation are immutable and out of the model's reach); a deterministic monitor that zero-rewards any trajectory that reads withheld paths or edits verification scripts; and a frozen LLM judge that acts as a veto on top of the verifier. Worth knowing, because it is the difference between "the benchmark numbers mean something" and "the model gamed its own grader."

A few concrete facts to pin down before you install anything:

Base models (per variant): the 9B Dense, 35B MoE, and 397B MoE are post-trained on Qwen 3.5; the 31B Dense is post-trained on Gemma 4, per the official Ornith FAQ. This is post-training, not a from-scratch pretrain — DeepReinforce's contribution is the RL and the self-scaffolding objective layered on those bases. The Qwen lineage has a practical consequence covered below: Ornith uses Qwen-style tool-calling and a <think> reasoning block. If you already run Qwen locally, our guide to running Qwen 3.6 locally covers a closely related setup.
License: MIT. Commercial use, fine-tuning, and shipping it inside a product are all permitted without a separate grant.
Context window: 256K tokens per the Ollama listing — that is 262,144 tokens, and one public vLLM run serves the 35B at that full length. Long enough to hold a meaningful slice of a repo plus the agent's working state.
Distribution: weights on Hugging Face under deepreinforce-ai, with formats varying by size (9B in bf16 + GGUF; 35B in bf16 + GGUF + FP8; 397B in bf16 + FP8 — full breakdown below); the 9B and 35B are also on the Ollama library.

Which Ornith 1.0 variant should you run?

Four variants: two dense, two mixture-of-experts (MoE). The practical decision is almost entirely about VRAM, so here is the lineup mapped to what it needs to load. Memory figures come from DeepReinforce's official Ornith site and model page; download sizes come from the Ollama library listing.

Variant	Architecture	Base	Approx. memory to run	How to run it
9B	Dense, 9B	Qwen 3.5	~19 GB (bf16) · ~6 GB (Q4) · 5.6 GB Ollama tag	Ollama, LM Studio
31B	Dense, 31B	Gemma 4	~62 GB (bf16) · ~20 GB (Q4)	vLLM — once a public checkpoint exists
35B MoE	MoE, 35B total (~3B active/token)	Qwen 3.5	~25 GB (Q5_K_M) · 21 GB Ollama tag	Ollama, LM Studio, vLLM
397B MoE	MoE, 397B total	Qwen 3.5	~400 GB (bf16) · ~200 GB (FP8), across 8×80 GB GPUs	vLLM (multi-GPU)

My read on the lineup:

9B is the default and the right starting point. Quantized to Q4 it fits in roughly 6 GB, so it runs on a mid-range laptop GPU, an 8 GB card, or any Apple Silicon Mac with 16 GB of unified memory. It is also the ornith:latest tag, so ollama run ornith gives you this one. Early community reports say even the 9B at Q6_K holds up inside an agent loop (one pairs it with the Hermes agent) — not flagship-quality, but a real, private workhorse.
35B MoE is the sweet spot for a 24 GB card or a 32 GB+ Mac. Because only ~3B parameters are active per token, it generates at a speed closer to a small dense model while drawing on far more total capacity — DeepReinforce even frames it as effectively faster than the 9B for that reason. The Ollama tag is 21 GB and the Q5_K_M footprint is ~25 GB, so a 24 GB GPU or a 32 GB+ unified-memory Mac is the practical floor, with context length affecting headroom. This is the one I would actually use day to day.
31B Dense is the odd one out: it is the only variant built on Gemma 4 rather than Qwen 3.5, it is heavier to run than the 35B MoE (a dense 31B keeps all 31B active), and — critically — there is no public checkpoint for it yet. The Hugging Face collection ships 9B, 35B, and 397B in bf16/GGUF/FP8; the 31B is announced but absent. Treat it as not-yet-runnable until DeepReinforce publishes it.
397B MoE is the flagship and the one the headline numbers are about — but it needs roughly 200 GB even in FP8, spread across eight 80 GB GPUs. This is a datacenter or rented-8×H100 model, not a homelab one. Run it "locally" only if locally means a node you control.

If you are weighing runtimes in general, our breakdown of Ollama vs LM Studio vs vLLM vs llama.cpp vs MLX goes deeper than this post can; the short version is below.

Run Ornith 1.0 with Ollama (the easy path)

Ollama is the lowest-friction way to get Ornith running and the only runtime here with official Ornith library tags. Grab Ollama from ollama.com — native macOS and Windows apps, plus a one-line Linux installer. Then pull and run a model.

For most laptops, start with the 9B:

ollama run ornith:9b

If you have a 24 GB GPU or a Mac with plenty of unified memory, the 35B MoE is the better experience:

ollama run ornith:35b

And if you just want the default without thinking about it, this pulls the latest tag, which is the 9B:

ollama run ornith

The tags currently published to the Ollama library are:

ornith:latest — the 9B, 5.6 GB download, 256K context
ornith:9b — 5.6 GB, 256K context
ornith:35b — 21 GB, 256K context

Note what is not there: the 31B and 397B are not on Ollama. For the 397B you pull the weights from Hugging Face and serve them with vLLM (below); the 31B has no public checkpoint at all yet. The first time you run a tag, Ollama downloads it; after that the same command starts a chat session. The daemon also exposes an OpenAI-compatible API on http://localhost:11434/v1 the moment it is running, which is what makes Ornith easy to plug into editors and agents. One practical note: don't assume the full 256K context for local runs — a lower context setting is usually faster and far more memory-stable, and Ornith's long reasoning blocks (more on that below) eat context quickly.

Run Ornith 1.0 with LM Studio (GUI + local server)

If you would rather not live in a terminal, LM Studio is the friendliest option. It ships a model browser, a chat UI, and a built-in OpenAI-compatible server, and it runs GGUF files — which Ornith provides. The workflow:

Install LM Studio and open the model search/discover tab. Our LM Studio complete guide covers the install and quant-picking UI in detail.
Search for Ornith, or load the GGUF repos directly: deepreinforce-ai/Ornith-1.0-9B-GGUF for the 9B and deepreinforce-ai/Ornith-1.0-35B-GGUF for the 35B MoE.
Pick a quant that fits your memory — Q4 for the 9B if you are tight on VRAM, Q5_K_M for the 35B if you have ~25 GB to spare — and download it.
Load the model and chat, or flip on LM Studio's Local Server to expose an OpenAI-compatible endpoint (default http://localhost:1234/v1) for use from your editor or scripts.

There are no Ornith-specific LM Studio commands to publish — it is a GUI; the steps above are just standard LM Studio usage pointed at the Ornith GGUF repos. It is the easiest way to A/B different quant levels of the 9B and 35B without memorizing flags, and the server toggle gives you the same OpenAI-compatible API as Ollama in a couple of clicks. One Ornith-specific caveat carries over from below: because it is a reasoning model, set a generous max-tokens / response length in the chat settings, or the model can spend its whole budget thinking and never reach the answer.

Run the bigger variants with vLLM

For the 35B at full or FP8 precision and the 397B flagship — anything beyond the convenience tags — vLLM is the standard tool. It is a high-throughput GPU inference server with tensor parallelism for splitting a model across cards, and it speaks the OpenAI API out of the box. DeepReinforce doesn't publish Ornith-branded vLLM commands, so the snippets below are standard vLLM usage pointed at the Ornith Hugging Face repos — but two flags are genuinely Ornith-specific and you will want them.

The simplest case — serve the 35B from its bf16 weights:

pip install vllm
vllm serve deepreinforce-ai/Ornith-1.0-35B --trust-remote-code

For agentic use you almost certainly want the FP8 build, a long context, and — this is the important part — the parsers that make Ornith's <think> reasoning and Qwen-style tool calls come back in clean OpenAI format. A public run on an NVIDIA GB10 / DGX Spark-class box documents a working configuration; lightly generalized:

vllm serve deepreinforce-ai/Ornith-1.0-35B-FP8 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name ornith-35b \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.80 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --trust-remote-code

What those Ornith-specific flags do, and why they matter:

--tool-call-parser qwen3_xml — Ornith inherits Qwen's tool-call format. With this parser, function-calling prompts come back with finish_reason: "tool_calls" and proper OpenAI-format tool_calls, which is what an agent harness needs.
--reasoning-parser qwen3 — Ornith emits a <think> block before its answer. This parser splits that into a separate reasoning field instead of leaking it into content, so your client can hide the thinking and act on the answer.
--kv-cache-dtype fp8 and --gpu-memory-utilization 0.80 — needed to fit a long context on a single ~128 GB-class box; the same report saw a cudaErrorIllegalInstruction at 0.85 during CUDA-graph capture and had to back off to 0.80.

The 397B flagship needs roughly eight 80 GB GPUs (per DeepReinforce's own hardware note), served in FP8 with tensor parallelism — the command itself is standard vLLM:

vllm serve deepreinforce-ai/Ornith-1.0-397B-FP8 --tensor-parallel-size 8 --trust-remote-code

The repos actually visible in the deepreinforce-ai collection are the 9B (bf16 + GGUF), the 35B (bf16 + GGUF + FP8), and the 397B (bf16 + FP8). The 31B Dense has no repo, so there is nothing to vllm serve for it today; when it lands, the pattern is identical — pass its repo id. Once any model is up, vLLM serves an OpenAI-compatible endpoint on http://localhost:8000/v1.

Rule of thumb on runtimes: Ollama for the 9B/35B on one machine, LM Studio for a GUI and easy quant swapping, vLLM when you need throughput, multi-GPU, long context, or the FP8 builds Ollama doesn't carry.

Per-OS quick notes

The commands above are cross-platform, but the practical details differ by OS:

Linux (NVIDIA): the full menu is open to you. Ollama's one-line installer, LM Studio's AppImage, or vLLM with CUDA. vLLM is Linux-first — the 35B-FP8 and 397B paths assume a recent CUDA stack and, for the flagship, multi-GPU with tensor parallelism. Add --trust-remote-code (Ornith ships a custom chat template) and budget 8–10 minutes of cold-start on the first vLLM launch while weights load and CUDA graphs capture.
macOS (Apple Silicon): Ollama and LM Studio both run natively on the Metal backend and are the realistic paths; vLLM is not the Mac story. The 9B Q4 is comfortable on a 16 GB Mac; the 35B MoE wants 32 GB+ of unified memory. Because MoE keeps only ~3B active per token, the 35B feels snappier on a Mac than its 35 GB-class total would suggest. Our Apple Silicon LLM guide covers unified-memory headroom and how to leave room for the rest of your system.
Windows: the native Ollama and LM Studio apps both work. For vLLM, run it under WSL2 with the CUDA toolkit rather than native Windows — that is the supported, low-friction path.

How good is Ornith 1.0, really?

Unlike a lot of releases, DeepReinforce published benchmarks for the small variants too — so the models you actually run at home have numbers, not just the flagship. All figures below are self-reported from the official page; read them as a vendor's best case, not an independent audit. The first table is the 397B flagship against the current frontier.

Benchmark	Ornith 397B	Qwen3.5 397B	GLM-5.2 744B	Opus 4.7	Opus 4.8
Terminal-Bench 2.1 (Terminus-2)	77.5	53.5	81.0	70.3	85.0
Terminal-Bench 2.1 (Claude Code)	78.2	48.6	82.7	69.7	78.9
SWE-Bench Verified	82.4	76.4	—	80.8	87.6
SWE-Bench Pro	62.2	51.6	62.1	64.3	69.2
SWE-Bench Multilingual	78.9	69.3	—	—	—
NL2Repo	48.2	36.8	48.9	—	69.7
ClawEval (avg)	77.1	70.7	—	78.2	—
SWE Atlas (QnA / RF / TW)	41.2 / 42.6 / 39.1	20.4 / 18.4 / 18.5	—	40.3 / 48.6 / 38.5	48.8 / 46.7 / —

The honest framing, because this is exactly the claim that gets inflated in launch coverage: the 397B Ornith beats Claude Opus 4.7 on the two headline benchmarks — Terminal-Bench 2.1 and SWE-Bench Verified. That is a real result for an MIT-licensed, downloadable model. But it is not a clean sweep: Opus 4.7 still leads Ornith on SWE-Bench Pro (64.3 vs 62.2), ClawEval (78.2 vs 77.1), and SWE Atlas RF (48.6 vs 42.6). And the flagship clearly trails the genuine frontier — Claude Opus 4.8 is ahead on essentially everything it reports, and GLM-5.2-744B leads on both Terminal-Bench harnesses. So the accurate one-liner is: an open, MIT-licensed coding model that edges Opus 4.7 on the marquee tests while trailing Opus 4.8 and GLM-5.2-744B. For where Ornith sits among the other open models worth running, see our open-source LLM landscape for 2026.

Two caveats matter more than the exact decimals. First, harness dependence is real and large. Look at the top two rows: the same model on the same Terminal-Bench 2.1 scores 77.5 under the Terminus-2 harness but 78.2 under the Claude Code harness — the harness alone moves the number. DeepReinforce's footnotes spell out that every score comes from a specific harness (Terminus-2, Claude Code 2.1.126, OpenHands, mini-SWE-agent) at a specific temperature and context length. Change the loop and you change the result, which is the self-scaffolding thesis made literal. Second, these are self-reported; the r/LocalLLaMA release thread carried a wry caveat from the start — "let's see if this holds" — and independent reproductions were still trickling in at publish time.

Now the part the original launch buzz buried — the variants you can actually run:

Benchmark	Ornith 9B	Ornith 35B	Qwen3.5 35B	Qwen3.6 35B	Gemma4 31B
Terminal-Bench 2.1 (Terminus-2)	43.1	64.2	41.4	52.5	42.1
SWE-Bench Verified	69.4	75.6	70.0	73.4	52.0
SWE-Bench Pro	42.9	50.4	44.6	49.5	35.7
NL2Repo	27.2	34.6	20.5	29.4	15.5
ClawEval (avg)	63.1	69.8	65.4	68.7	48.5

This is the more useful story for local use. The 35B MoE is a genuine standout in its weight class — 64.2 on Terminal-Bench (Terminus-2) clears Qwen 3.6-35B (52.5) and even beats Qwen 3.5's own 397B (53.5) on that test, and it tops the field on SWE-Bench Verified, SWE-Bench Pro, NL2Repo, and ClawEval among same-size peers. The 9B is a respectable triage model: 43.1 on Terminal-Bench is roughly on par with Gemma 4-31B and far above same-size Qwen 3.5-9B (21.3), though it sits below the 35B on every row. The takeaway: the 35B MoE is the local model worth building a workflow around, the 9B is the fast private fallback, and neither will match the 397B (or a hosted frontier model) on the longest multi-file agentic tasks.

Wire Ornith 1.0 into a coding agent

The whole point of a self-scaffolding model is to drive an agent, and the path there is the OpenAI-compatible endpoint every runtime exposes — Ollama on :11434/v1, LM Studio on :1234/v1, vLLM on :8000/v1. Point any tool that accepts a custom OpenAI base URL at that address, select the Ornith model, and you have a local backend. Our guide to AI coding agents covers the broader landscape of harnesses; the Ornith-specific notes are below.

DeepReinforce positions Ornith as terminal-agent-native and lists out-of-the-box compatibility with Claude Code, OpenHands, OpenClaw, and Hermes Agent — and the benchmark footnotes back that up, since the official scores were produced inside exactly those harnesses (Claude Code 2.1.126 and OpenHands among them). A community run additionally wires the 35B into opencode as a custom OpenAI provider with apiKey: "EMPTY". Editors like Cline and Cursor accept a custom OpenAI base URL too, so a locally served Ornith can back their agent features — that is a generic capability of those tools, not an Ornith-branded integration, so treat it as "should work" rather than "officially supported."

Three things will make or break the experience, and all three come from Ornith being a Qwen-derived reasoning model:

Separate the reasoning from the answer. Ornith is a reasoning model that emits a <think> block before its answer (DeepReinforce says every response starts with one). Under vLLM, --reasoning-parser qwen3 peels it into a reasoning field; with Ollama or LM Studio the runtime handles the split, but your agent client still needs to expect a thinking phase and not treat it as the final output.
Make sure tool calls parse. Ornith uses Qwen-style tool calling. Under vLLM that means --enable-auto-tool-choice --tool-call-parser qwen3_xml so calls return as OpenAI tool_calls. If your harness sees tool calls arrive as plain text instead of structured calls, the parser is the first thing to check.
Give it room to think. Reasoning models burn tokens before they answer. In one measured run, a code-generation prompt capped at 2,048 output tokens spent the entire budget inside the reasoning block and never reached the code. For agent use, set max output to at least 4,096 (the model intro itself used ~1,300 reasoning tokens before a 448-token answer).

Pair the 35B MoE with a focused agent harness and a sane token budget and you get a private, MIT-licensed coding agent whose model inference stays on hardware you control — no code goes out over a third party's API, which for proprietary work is frequently the whole reason to run locally. (Your editor or agent harness may still send its own telemetry; only the model inference is guaranteed local.)

Troubleshooting

The rough edges in the first week clustered around a few predictable spots:

Out of memory / CUDA crashes on vLLM. Long context is the usual culprit. Add --kv-cache-dtype fp8 so the KV cache doesn't blow the budget at 256K, and back --gpu-memory-utilization down — a documented GB10 run crashed with cudaErrorIllegalInstruction at 0.85 during graph capture and only stabilized at 0.80. If you hit block_size ... must be <= max_num_batched_tokens, raise --max-num-batched-tokens to 4096. On Ollama or LM Studio, the equivalent fix is simpler: lower the context length.
Slow tokens/sec. Expectations matter here. On a GB10 / DGX Spark-class box, the 35B-FP8 measured ~38 tok/s at 262K context — respectable for a 35B-class model on that hardware, but not datacenter throughput. On consumer GPUs, prefer the 35B MoE over a dense model of similar footprint (only ~3B active per token), keep context modest, and use FP8 or a Q5_K_M GGUF rather than bf16.
Picking a quant. 9B: Q4 fits ~6 GB and is the entry point for 8 GB cards and 16 GB Macs; step up to Q5/Q6 if you have the room and want fewer mistakes (Q6_K is reported to behave well in agent loops). 35B: Q5_K_M (~25 GB) is the sweet spot on a 24 GB+ GPU or 32 GB+ Mac; FP8 is the better choice under vLLM. Reach for bf16 only with memory to burn.
Tool calling is unreliable or comes back as text. This is the most common complaint, and it traces to Ornith's Qwen-style format. Under vLLM, set the parsers above. On Ollama, if the stock tags misbehave for tool/thinking use, a community-maintained re-export tuned for tool and thinking support in agent harnesses has circulated (search the Ollama library for community Ornith ports) — useful if the official tags don't slot cleanly into your harness.
It says it's "Qwen" when asked who it is. Not a bug. Because Ornith is post-trained on Qwen, it can name its base model when asked to introduce itself. Don't use "what model are you?" as an identity check — confirm by repo id and tag instead.

The bottom line

Ornith 1.0 is worth your time for two reasons: the self-scaffolding training is a real idea rather than a marketing line, and the local variants are genuinely runnable with published numbers to back them. Start with ollama run ornith:9b to kick the tires, move to ollama run ornith:35b if you have a 24 GB card or a roomy Mac, and reach for vLLM with the Hugging Face FP8 weights when you need long context, tool calling, or the 397B flagship. Keep the benchmark story straight — it edges Opus 4.7 on the marquee tests, not Opus 4.8 or GLM-5.2-744B — and remember the 31B Dense isn't downloadable yet. For teams that want a private, MIT-licensed model on hardware they control, that is a strong, honest place to land; if you would rather have engineers who already work this way build it into your stack, Codersera can help you extend your team with vetted remote developers.

FAQ

Is Ornith 1.0 free to use commercially?

Yes. DeepReinforce released Ornith 1.0 under the MIT license, which permits commercial use, modification, fine-tuning, and redistribution — you can build it into a product without a separate agreement. As always, read the license text in the Hugging Face repo for the authoritative terms.

What is the smallest GPU that can run Ornith 1.0?

The 9B at Q4 needs roughly 6 GB, so an 8 GB GPU or any Apple Silicon Mac with 16 GB of unified memory runs it comfortably; the Ollama 9B tag is a 5.6 GB download. For the 35B MoE, plan on about 25 GB (Q5_K_M), which suits a 24 GB card or a 32 GB+ Mac.

Can I download and run the 31B Dense variant?

Not yet. The 31B Dense (the only Gemma 4-based variant) is announced in the lineup, but as of publish time there is no public checkpoint for it on Hugging Face — the collection ships the 9B, 35B, and 397B in bf16, GGUF, and FP8. Wait for DeepReinforce to publish the 31B before building around it.

Does Ornith 1.0 beat Claude Opus 4.8?

No. The 397B flagship scores 77.5 on Terminal-Bench 2.1 (Terminus-2) and 82.4 on SWE-Bench Verified, which tops Claude Opus 4.7 on both — but it trails Claude Opus 4.8 across the board and trails GLM-5.2-744B on Terminal-Bench. It is a strong open-weights model that sits just behind the current frontier rather than leading it.

Why does Ornith need special vLLM flags for tool calling?

Ornith is post-trained on Qwen 3.5, so it uses Qwen-style tool calling and a <think> reasoning block. Under vLLM you set --tool-call-parser qwen3_xml so function calls return as structured OpenAI tool_calls, and --reasoning-parser qwen3 so the thinking is split into a separate field instead of polluting the answer. Without them, agent harnesses often see tool calls as plain text.

Which Ornith variant is best for a Mac or a 24 GB GPU?

The 35B MoE. Because only ~3B parameters are active per token, it runs close to a small dense model's speed while drawing on far more total capacity, and at Q5_K_M it fits in roughly 25 GB. Run it with ollama run ornith:35b, or load deepreinforce-ai/Ornith-1.0-35B-GGUF in LM Studio. It also posts the strongest same-size benchmark numbers of any local Ornith variant.