OmniCoder 9B

OmniCoder 9B: Benchmarks, GGUF Quants, and Local Setup Guide (2026)

What OmniCoder 9B is, its lineage and license, vendor-reported benchmarks, the full GGUF quant table, and step-by-step Ollama and llama.cpp setup.

Published 18 May 2026 • Updated 15 Jun 2026 • 11 min read

Quick answer. OmniCoder 9B is an open-source (Apache 2.0) coding-agent model from Tesslate, fine-tuned on Qwen3.5-9B with 425K+ agentic coding trajectories. It ships official GGUF quants from 3.83 GB (Q2_K) to 17.9 GB (BF16), runs locally via Ollama or llama.cpp, and targets autonomous tool-use and terminal coding rather than chat.

OmniCoder 9B is one of the more interesting small coding models to land in 2026: a 9-billion-parameter model built specifically to behave like a coding agent — recovering from errors, reading before writing, emitting edit diffs instead of full-file rewrites — at a size that runs on a single consumer GPU or a recent Apple Silicon laptop. This guide covers what it actually is, where it comes from, its benchmark numbers (all vendor-reported, labelled as such), the complete GGUF quant table with real file sizes, and copy-pasteable local setup for both Ollama and llama.cpp.

If you only want to deploy it as fast as possible, skip to the setup section. If you are deciding whether it belongs in your stack, read the lineage and comparison sections first — a 9B model's value depends heavily on what you are comparing it against.

What is OmniCoder 9B?

OmniCoder 9B is a coding-agent model published by Tesslate on Hugging Face (Tesslate/OmniCoder-9B). It is a 9-billion-parameter model released under the Apache 2.0 license, which means you can run it commercially, self-host it, and redistribute fine-tunes without a separate commercial agreement.

The key framing: this is not a general chat model that happens to code. Per the model card, it was fine-tuned specifically on agentic coding behaviour — multi-step tool use, terminal operations, and the kind of read-diagnose-edit loops that an autonomous coding agent runs. Tesslate reports it learns patterns like reading a file before writing to it, responding to LSP diagnostics, and producing scoped edit diffs rather than rewriting whole files.

Maker: Tesslate
Parameters: 9B
Base model: Qwen3.5-9B
License: Apache 2.0
Native context: 262,144 tokens (vendor states it is extensible to 1M+)
Primary use: agentic / tool-calling coding, not open-ended chat
Distribution: full-precision weights plus an official GGUF repo for local inference

What is OmniCoder 9B built on?

OmniCoder 9B is a fine-tune of Qwen3.5-9B. It inherits Qwen3.5's hybrid architecture — Gated Delta Networks interleaved with standard attention — which is what gives the 9B model its long native context and reasonable long-context inference cost. It also inherits Qwen3.5's optional thinking mode (reasoning emitted inside <think>...</think> spans).

According to Tesslate's model card, the fine-tune used LoRA SFT (rank 64, alpha 32) on roughly 425,000+ curated agentic coding trajectories, trained in bf16 on 4×NVIDIA H200 with the Axolotl framework. The training trajectories were built from agentic and coding reasoning traces — the card names Claude Opus 4.6, GPT-5.4, GPT-5.3-Codex, and Gemini 3.1 Pro as sources of successful trajectories, and targets scaffolding patterns from Claude Code, OpenCode, Codex, and Droid. Treat these as Tesslate's stated methodology, not independently audited claims.

The practical takeaway: OmniCoder 9B's ceiling is roughly Qwen3.5-9B's general capability, with the delta being agentic behaviour and tool-use formatting. If you want a baseline for what the un-fine-tuned lineage can already do as a coding agent, our walkthrough of running and benchmarking Qwen3.5 as a free local coding agent is the closest reference point. It is not a from-scratch model and should not be expected to beat a much larger model on raw reasoning — its pitch is competence-per-byte for autonomous coding loops.

How does OmniCoder 9B perform on benchmarks?

All numbers below are vendor-reported by Tesslate on the official model card. They have not been independently reproduced here, and small-model benchmark scores are notoriously sensitive to prompt format and sampling settings. Use them as directional, not definitive.

Benchmark	Metric	OmniCoder 9B	Qwen3.5-9B base (per card)	Delta
AIME 2025	pass@5	90% (27/30)	not stated	—
GPQA Diamond	pass@1	83.8% (166/198)	81.7%	+2.1 pts
GPQA Diamond	pass@3	86.4% (171/198)	not stated	—
Terminal-Bench 2.0	pass rate	23.6% (21/89)	14.6%	+8.99 pts (+61%)

How to read this honestly:

Terminal-Bench 2.0 is the relevant signal. It measures the thing OmniCoder 9B is built for — completing real terminal/coding tasks autonomously. A jump from 14.6% to 23.6% over the base model is a meaningful relative gain from fine-tuning, but 23.6% in absolute terms is still a model that fails most hard agentic tasks. That is expected at 9B; calibrate accordingly.
GPQA and AIME measure reasoning/math, not coding. The small GPQA lift and high AIME pass@5 mostly tell you the fine-tune did not damage the inherited Qwen3.5 reasoning ability. They are not coding-quality evidence.
No SWE-bench Verified number is published on the card at the time of writing. Anyone citing a SWE-bench figure for this model should be treated as unverified until Tesslate publishes one.

Which GGUF quant should you download?

Tesslate publishes an official quantized repo at Tesslate/OmniCoder-9B-GGUF. The table below lists every file in that repo with its exact size as shown in the Hugging Face file listing. The "approx. VRAM/RAM" column is an engineering estimate (model file size plus headroom for KV cache and runtime overhead at a modest context), not a number from the model card.

Quant	File size	Approx. VRAM/RAM to run comfortably	Notes
Q2_K	3.83 GB	~6 GB	Smallest; noticeable quality loss, last resort
Q3_K_S	4.26 GB	~6–7 GB	Tight footprint
Q3_K_M	4.62 GB	~7 GB	Balanced low-end
Q3_K_L	4.93 GB	~7 GB	Slightly better than Q3_K_M
Q4_0	5.31 GB	~8 GB	Legacy 4-bit
Q4_K_S	5.35 GB	~8 GB	Good balance
Q4_K_M	5.74 GB	~8 GB	Recommended default for most users
Q5_0	6.31 GB	~9 GB	Higher quality
Q5_K_S	6.31 GB	~9 GB	Higher quality
Q5_K_M	6.52 GB	~9–10 GB	High quality, balanced
Q6_K	7.36 GB	~10–11 GB	Near-lossless
Q8_0	9.53 GB	~12–13 GB	Highest-quality quantization
BF16	17.9 GB	~20–24 GB	Full precision; use the GPU weights instead unless you need GGUF tooling

Practical guidance:

8 GB GPU / 16 GB Mac: use Q4_K_M. It is the recommended default and the best quality-per-byte point for a 9B model.
12 GB+ GPU / 24 GB+ Mac: step up to Q6_K or Q8_0 if you are using the model for real agentic work where quantization artefacts compound across many tool-call turns.
Under 8 GB: Q3_K_M will run, but on a model this small the quality drop at 3-bit is real. Prefer Q4_K_M with CPU/RAM offload over a 3-bit fit if you can tolerate the slower tokens/sec.
Skip Q2_K unless you have no alternative — at 9B, 2-bit degrades agentic reliability enough to undermine the model's whole purpose.

How do you run OmniCoder 9B with Ollama?

Ollama can pull GGUF directly from a Hugging Face repo, so you do not need a community mirror. Use the official Tesslate GGUF repo and pick a quant tag.

# Pull and run the recommended Q4_K_M quant from the official repo
ollama run hf.co/Tesslate/OmniCoder-9B-GGUF:Q4_K_M

# Higher quality if you have the VRAM
ollama run hf.co/Tesslate/OmniCoder-9B-GGUF:Q8_0

For agentic / tool-calling use, Tesslate recommends a lower sampling temperature for more deterministic behaviour. You can pin that with an inline modelfile:

# Create a tuned variant for agent use
cat > Modelfile <<'EOF'
FROM hf.co/Tesslate/OmniCoder-9B-GGUF:Q4_K_M
PARAMETER temperature 0.3
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER num_ctx 32768
EOF

ollama create omnicoder-agent -f Modelfile
ollama run omnicoder-agent

A few notes. The card's recommended general settings are temperature 0.6 / top-p 0.95 / top-k 20; for agentic/tool-calling tasks it suggests lowering temperature to roughly 0.2–0.4. The native context is 262,144 tokens, but set num_ctx to what your hardware can actually hold in KV cache — 32K is a sensible starting point on an 8 GB GPU; raising it raises memory use sharply.

How do you run OmniCoder 9B with llama.cpp?

llama.cpp can also fetch directly from the Hugging Face repo with --hf-repo and --hf-file, which is the cleanest path and matches the official model-card instructions.

# Interactive chat (downloads the quant on first run)
llama-cli \
  --hf-repo Tesslate/OmniCoder-9B-GGUF \
  --hf-file omnicoder-9b-q4_k_m.gguf \
  -p "Refactor this function and explain the change." \
  -c 32768

# OpenAI-compatible server (point Continue/Aider/your harness at it)
llama-server \
  --hf-repo Tesslate/OmniCoder-9B-GGUF \
  --hf-file omnicoder-9b-q4_k_m.gguf \
  -c 32768 \
  --port 8080

With llama-server running, you have an OpenAI-compatible endpoint at http://localhost:8080/v1 that any agent harness expecting the OpenAI API can target — if you want a ready-made local agent loop around it, our self-hosted AI coding agent setup with Ollama + Cline/Continue wires this exact pattern end to end. Add GPU offload flags (-ngl) to push layers onto the GPU; on Apple Silicon llama.cpp uses Metal automatically. For a deeper, end-to-end deployment walkthrough — including vLLM, hardware sizing, and an agent-loop test harness — see our dedicated install guide linked below, and our vLLM vs Ollama vs LM Studio production benchmark if you are choosing a serving stack for real traffic rather than a laptop demo.

Companion guide

For how OmniCoder 9B fits among the wider field of self-hostable models — sizes, licenses, and trade-offs — see our open-source LLMs landscape for 2026. For an end-to-end deployment walkthrough, see our OmniCoder 9B local install guide.

How does OmniCoder 9B compare to other small coding models?

At the 7–9B tier, the question is rarely "which is smartest" — they are all capped by size. The question is which behaviour you want. OmniCoder 9B's differentiator is that it was fine-tuned for autonomous agent loops, not chat or autocomplete.

Model	Size	License	Best at	Trade-off vs OmniCoder 9B
OmniCoder 9B	9B	Apache 2.0	Agentic loops, tool use, terminal tasks	—
Qwen3.5-9B (base)	9B	Apache 2.0	General reasoning + code chat	Stronger general chat; weaker agentic tool-use formatting (it is the base OmniCoder fine-tunes)
Qwen2.5-Coder 7B	7B	Apache 2.0	Autocomplete / tab-complete, small refactors	Better fill-in-the-middle; not built for multi-step agent loops
Yi-Coder 9B	9B	Apache 2.0	Single-shot code generation (strong HumanEval)	Higher single-shot code scores; not agent-tuned
DeepSeek-Coder-V2-Lite 16B (MoE)	16B (MoE)	Permissive	Broad code generation at ~3B speed	Larger footprint (~10 GB); strong general coding, not agent-loop-specialised

Decision shortcut:

Want a local model that drives an agent harness (tool calls, terminal, edit diffs)? OmniCoder 9B is purpose-built for exactly this.
Want the best IDE autocomplete at this size? Qwen2.5-Coder 7B is still the reference point.
Want the strongest single-shot code generation in a 9B Apache-2.0 model? Yi-Coder 9B's HumanEval reputation is the one to benchmark against.
Have ~16 GB and want broad coding range? DeepSeek-Coder-V2-Lite is worth a head-to-head.

Comparison caveat: cross-model claims above are based on each model's general reputation and published positioning, not a single controlled benchmark run on identical hardware. For a buying decision, benchmark the two or three finalists on your tasks.

What are the limitations of OmniCoder 9B?

It is a 9B model. Terminal-Bench 2.0 at 23.6% means it fails the majority of hard autonomous tasks. It is a capable assistant inside a well-scaffolded loop, not a drop-in replacement for a frontier hosted agent.
Non-English coverage is unevaluated. The model card explicitly states performance on non-English tasks has not been extensively evaluated.
Tool-calling is scaffolding-sensitive. Tesslate states the format is flexible but works best with the scaffolding patterns seen in training (Claude Code / OpenCode / Codex / Droid-style). Wildly different harness formats may underperform.
Benchmarks are vendor-reported. No independent third-party reproduction is cited here, and no SWE-bench Verified score is published. Validate on your own evals before relying on it.
Quantization compounds in agent loops. Low-bit quants that look fine for single-turn chat can degrade reliability across a long multi-turn tool-use trajectory. Prefer Q5/Q6/Q8 for serious agentic use if hardware allows.
Community Ollama mirrors vary. Several third-party Ollama uploads exist; prefer pulling from the official hf.co/Tesslate/OmniCoder-9B-GGUF repo so you know exactly which quant and revision you are running.

If you are evaluating local coding models to embed in a product or internal developer platform, the hard part is rarely picking the model — it is building the agent harness, evals, and guardrails around it so a 9B model performs reliably, on top of the operational realities our complete guide to self-hosting LLMs in 2026 covers. Codersera matches you with vetted remote developers experienced with local LLM deployment, GGUF/quantization tuning, and agentic tooling, with a risk-free trial so you can confirm technical fit before you commit.

Updated for June 2026: what to re-check before you deploy

This guide was reviewed in June 2026. The core facts below still hold against Tesslate's official model card and GGUF repo, but small coding models move fast — verify these three things against the live listings before you commit a download or quote a benchmark.

Confirm the exact GGUF filename and revision. Ollama and llama.cpp pull by tag/filename from hf.co/Tesslate/OmniCoder-9B-GGUF. Tesslate can re-upload quant revisions, so open the Hugging Face file listing and match the filename (e.g. omnicoder-9b-q4_k_m.gguf) and size before you script it into a deployment. Q4_K_M (5.74 GB) remains the recommended default for an 8 GB GPU or a 16 GB Apple Silicon Mac.
Benchmarks are still vendor-reported. As of this review, the AIME 2025, GPQA Diamond, and Terminal-Bench 2.0 (23.6%) figures are still Tesslate's own, and there is still no published SWE-bench Verified score on the card. If you see a third-party SWE-bench number circulating for this model, treat it as unverified until it lands on the official model card.
Re-test your harness format. OmniCoder 9B's tool-calling is scaffolding-sensitive. If you have upgraded your agent runner (Claude Code, OpenCode, Codex, or a custom loop) since first deploying, re-run a short agent-loop eval — a changed prompt format stacked on aggressive quantization is the most common cause of a sudden reliability drop on a model this small.

Nothing in the setup, quant table, or comparison sections above has been superseded as of June 2026; this section flags what to re-verify rather than replacing any of it.

FAQ

Is OmniCoder 9B free and open source?

Yes. OmniCoder 9B is released by Tesslate under the Apache 2.0 license, which permits commercial use, self-hosting, modification, and redistribution. There is no separate paid tier required to run the open weights locally; your only cost is the hardware you run it on.

What hardware do you need to run OmniCoder 9B?

The recommended Q4_K_M quant is a 5.74 GB file and runs comfortably on roughly an 8 GB GPU or a 16 GB Apple Silicon Mac at a modest context. Higher-quality Q6_K/Q8_0 quants want ~10–13 GB. You can run smaller 3-bit quants under 8 GB, but quality drops noticeably on a model this size.

Is OmniCoder 9B better than its Qwen3.5-9B base model?

For agentic coding, Tesslate's reported numbers say yes: +8.99 points on Terminal-Bench 2.0 (a 61% relative gain) and a small GPQA lift. For general chat the base model may feel comparable, since OmniCoder's gains are concentrated in tool-use and terminal behaviour rather than broad reasoning. These are vendor-reported figures.

Does OmniCoder 9B work with Ollama?

Yes. Pull it directly from the official GGUF repo with ollama run hf.co/Tesslate/OmniCoder-9B-GGUF:Q4_K_M. Community Ollama mirrors also exist, but using the official Hugging Face repo guarantees you know the exact quant and revision you are running.

What is OmniCoder 9B best used for?

It is purpose-built to act as a local coding agent inside a scaffolded loop — multi-step tool calls, terminal operations, read-before-write edits, and responding to LSP diagnostics. It is less suited to being a general chatbot or a pure IDE autocomplete engine, where other small models specialise.

Are OmniCoder 9B's benchmarks independently verified?

Not as cited here. The AIME 2025, GPQA Diamond, and Terminal-Bench 2.0 numbers come from Tesslate's official model card and are vendor-reported. No independent third-party reproduction and no SWE-bench Verified score are published at the time of writing, so validate on your own tasks before relying on them.

How large is the OmniCoder 9B context window?

The native context is 262,144 tokens, inherited from the Qwen3.5-9B hybrid architecture, and Tesslate states it is extensible to 1M+ tokens. In practice you should set the runtime context (num_ctx / -c) to what your hardware's KV cache can hold; 32K is a reasonable starting point on an 8 GB GPU.