Quick answer. A self-hosted AI coding agent is Cline or Continue.dev pointed at http://localhost:11434, where Ollama serves an open coding model (Qwen3-Coder 30B, Devstral Small 2, or Kimi-K2.6-class on a bigger box) on a 24GB GPU. It runs offline at zero per-token cost. Honest caveat: community reports put a good local model at roughly 70–85% of cloud Claude on everyday single-file work, with a wider gap on multi-file reasoning.
A self-hosted AI coding agent is no longer a toy in 2026. The runtime (Ollama), the IDE agents (Cline, Continue.dev), and the open coding models (the Qwen3-Coder, Devstral, DeepSeek-V4, and Kimi-K2.6 families) have all matured enough that a single 24GB GPU runs a credible autonomous coding loop with your source never leaving the machine. The two reasons teams do this are unchanged and compelling: privacy (proprietary code never transits a third-party API) and zero marginal cost (no per-token bill, no rate limits, no monthly cap to defend at the next review).
This guide is command- and config-heavy on purpose. It gives you the exact stack, the install and config for both Cline and Continue.dev, a hardware/VRAM table per model and quant, and — most importantly — an honest quality reality check against cloud Claude and Composer so you self-host where it pays and don't where it doesn't.
What is a self-hosted AI coding agent?
It is three layers, all running on your own hardware:
- The model runtime — Ollama. Manages model downloads, quantization, and serving over a local HTTP API on port
11434. No API key, no account, no network egress. - The open coding model. An open-weight model (Apache 2.0 / MIT / Modified-MIT licensed) pulled into Ollama — Qwen3-Coder 30B, Devstral Small 2, or a Kimi-K2.6 / DeepSeek-V4-class model if you have the VRAM for it.
- The agent in your IDE. Cline (autonomous VS Code agent that plans, edits files, runs commands) or Continue.dev (chat + edit + autocomplete across VS Code, JetBrains, Neovim). Both speak to Ollama's local endpoint directly.
The contrast with cloud agents (Claude Code, Cursor Composer) is straightforward: cloud sends your code and prompts to a vendor API for the strongest models at a per-token price; self-host keeps everything local at the cost of a measurable quality step-down and the responsibility of running the infrastructure yourself.
Why should you self-host a coding agent at all?
Two reasons carry real weight; the rest are nice-to-haves.
- Privacy and data control. For regulated codebases, client work under strict NDAs, or anything where "our source code went to a third-party LLM API" is a sentence you cannot say to legal, local is the only answer that survives the conversation. Ollama runs entirely on-machine with no telemetry of prompt content; the model never sees the internet.
- Zero per-token cost. After the hardware spend, inference is free. No metered API, no surprise four-figure month from a runaway agent loop, no per-developer cap to administer. For high-volume, lower-stakes work (boilerplate, scaffolding, test stubs, routine refactors), the economics are decisive.
- Offline capability. The whole stack works with no internet connection — useful on locked-down networks, air-gapped environments, or just a bad-wifi flight.
- No rate limits or vendor lock-in. You own the model weights. A model that ships today still runs identically in two years; a deprecated cloud model does not.
What you trade away is raw capability on the hardest tasks and the operational simplicity of someone else running the GPUs. The rest of this guide is about making that trade with eyes open.
What is the recommended self-host stack?
For a 24GB GPU (RTX 4090, RTX 3090, or equivalent), the pragmatic default in 2026:
| Layer | Choice | Why |
|---|---|---|
| Runtime | Ollama (latest) | Simplest local serving; handles quant + KV cache; one HTTP endpoint |
| Primary model | qwen3-coder:30b (Q4_K_M) | MoE: 30B total / ~3.3B active per token — big-model quality, small-model speed; ~17–19GB at Q4 (per Ollama / Unsloth) |
| Dense alternative | Devstral Small 2 (24B) or Qwen3.6-27B | ~68% / ~77% SWE-bench Verified respectively (vendor-reported: Mistral / Qwen); fits 24GB |
| Autocomplete model | qwen2.5-coder:1.5b | Tiny, fast, code-trained — right size for tab completion |
| Agent (autonomous) | Cline (VS Code) | Plans, edits files, runs commands — the closest local analogue to Claude Code |
| Agent (chat/edit/complete) | Continue.dev | Multi-IDE; clean role separation; better for assisted editing than full autonomy |
If you have more than 24GB (a 48GB card, dual GPUs, or a 64GB+ unified-memory Mac), the upgrade path is a larger or higher-quant model from the Kimi-K2.6 / DeepSeek-V4 / Qwen3-Coder-Next family — these are the open models community benchmarks report as closest to frontier closed models in 2026, but they need substantially more memory than a single 24GB card provides at usable quants.
How do you install Ollama and pull a coding model?
Install Ollama, pull the models, confirm it serves:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull the primary agent model and a small autocomplete model
ollama pull qwen3-coder:30b
ollama pull qwen2.5-coder:1.5b
# Confirm the local API is up (default port 11434)
curl http://localhost:11434/api/tagsNow the single most important step that nearly every first-time setup gets wrong. Ollama's default context window is far too small for an agent. Ollama defaults a model's context to a low value (commonly 2048–4096 tokens depending on the model's baked-in template). An autonomous agent like Cline blows past that within the first few tool calls — the symptom is the agent silently failing or looping partway through a task. Raise it with a Modelfile, because a Modelfile PARAMETER num_ctx takes precedence over environment variables and over the model's baked-in default:
# Modelfile
FROM qwen3-coder:30b
PARAMETER num_ctx 65536# Build a custom tag with the larger context
ollama create qwen3-coder-agent -f Modelfile
# Use qwen3-coder-agent as the model name in Cline/Continue from here onA 64K window is widely reported by the agent-tooling community as the single most impactful reliability fix for tool-calling agents on Ollama. The tradeoff is real and linear: the KV cache grows with context length, so doubling num_ctx roughly doubles the KV-cache VRAM on top of the model weights. Size the window to what fits alongside the weights on your card (see the VRAM table below).
How do you configure Cline for fully local coding?
Cline is the autonomous option — it plans, edits files, and runs commands, which is the closest local experience to Claude Code. Setup:
- Install the Cline extension from the VS Code marketplace.
- Open the Cline panel → settings gear (top-right).
- Set API Provider to
Ollama. - Set Base URL to
http://localhost:11434. - Set the Model to your custom tag, e.g.
qwen3-coder-agent(the one withnum_ctx 65536, not the rawqwen3-coder:30b). - In Settings → Features, enable Use Compact Prompt. Cline's full system prompt is large; the compact prompt materially reduces the per-turn token load, which matters far more on a local model than on a cloud one.
Operational discipline matters more locally than in the cloud. Keep tasks tightly scoped, and start a fresh Cline task whenever context grows large rather than letting one session accumulate — a local model degrades faster than a frontier model as the window fills, and a bloated window also costs you VRAM you would rather spend on the weights.
How do you configure Continue.dev for local models?
Continue.dev is the better fit when you want assisted editing and autocomplete rather than full autonomy, and it cleanly separates models by role. Install the Continue extension, then use this config.yaml:
name: Local Config
version: 0.0.1
schema: v1
models:
- name: Qwen3 Coder 30B
provider: ollama
model: qwen3-coder-agent
apiBase: http://localhost:11434
roles:
- chat
- edit
- apply
capabilities:
- tool_use
- name: Qwen2.5 Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
apiBase: http://localhost:11434
roles:
- autocompleteThe roles array is the whole point: a model tagged autocomplete is used only for tab completion, while one tagged chat, edit, apply is used for chat sessions, the Edit feature, and applying diffs. Splitting a tiny model onto autocomplete keeps keystroke latency low while the heavy model handles reasoning. A common production pattern is to keep autocomplete and routine edits fully local and route only the hardest agent tasks to a cloud model — Continue's role routing makes that a config change, not a workflow change.
What hardware and VRAM do you need per model?
The two consumers of VRAM are the model weights (fixed once you pick a model + quant) and the KV cache (grows linearly with num_ctx). Budget for both. Figures below are community/vendor-reported approximate weight footprints — add headroom for the KV cache at your chosen context length:
| Model | Type | Quant | Approx. weights VRAM | Fits 24GB? |
|---|---|---|---|---|
| qwen2.5-coder:1.5b | Dense | Q4_K_M | ~1–2 GB | Yes (autocomplete) |
| qwen2.5-coder:7b | Dense | Q4_K_M | ~5 GB | Yes |
| Devstral Small 2 (24B) | Dense | Q4_K_M | ~14–16 GB | Yes (tight w/ large ctx) |
| Qwen3.6-27B | Dense | Q4_K_M | ~18–22 GB | Yes (small ctx only) |
| qwen3-coder:30b | MoE (~3.3B active) | Q4_K_M | ~17–19 GB | Yes (recommended) |
| Kimi-K2.6 / DeepSeek-V4 class | Large MoE | Q4 | Far >24 GB | No — needs 48GB+ / multi-GPU / big unified mem |
Practical reading of this table: on a single 24GB card the MoE qwen3-coder:30b at Q4_K_M is the sweet spot — it gives you 30B-class quality at the memory and speed of a ~3.3B dense model, leaving room for a usable context window. Dense 27B+ models technically fit but leave little KV-cache headroom, so you end up choosing between model size and context length. The frontier-class open models (Kimi-K2.6, DeepSeek-V4) are genuinely strong but are not a single-24GB-card story at usable quants.
Two quant rules of thumb: Q4_K_M is the standard quality/size balance and what most local setups should default to; going below Q4 (Q3, Q2) saves memory but the code-quality drop is steep and usually not worth it for an agent that needs to produce correct code.
Companion guide
For the full landscape of agents — cloud and local, autonomous and assisted — and how to pick one for your team, see our AI coding agents complete guide for 2026.
How good is a local coding agent versus cloud Claude?
This is the section that decides whether self-hosting is the right call, so it gets no spin. The honest picture from 2026 community benchmarks:
- The open/closed gap has narrowed dramatically but not vanished. On SWE-bench Verified, frontier closed models (GPT-5.5, Claude Opus 4.7) sit around 82%, and the strongest open models (Kimi-K2.6, Qwen3.6-class) are reported within striking distance — but those are the giant models, not what fits on one 24GB card.
- What actually fits on 24GB is meaningfully behind frontier on hard tasks. Community benchmarking of a strong local 32B-class model on consumer hardware reports it landing roughly within 85–90% of cloud Claude Sonnet on straightforward single-function generation and code explanation, while complex multi-file reasoning and subtle bug detection still clearly favor cloud Claude.
- The practical split most teams report is that a good local model on a consumer GPU handles roughly 70–80% of daily coding prompts at a quality the developer is happy with — boilerplate, scaffolding, routine refactors, single-file logic — while the remaining 20–30% (cross-cutting refactors, architecture, gnarly debugging) still go to a cloud model.
Against Cursor Composer / Claude Code specifically: the gap is widest exactly where those tools are strongest — long-horizon autonomous multi-file work with large context. A local 24GB agent is a good assistant and a competent autonomous worker on bounded tasks; it is not a drop-in replacement for a frontier cloud agent on a sprawling, ambiguous, whole-repo task. Treat the "~70–85% of cloud quality" framing as a community-reported directional figure, not a guarantee — it varies by language, task type, and how disciplined you are about context.
When is self-hosting worth it, and when is it not?
Self-host when:
- Code privacy is non-negotiable (regulated industry, strict NDA client work, air-gapped network).
- You run high volume of routine work (scaffolding, tests, boilerplate, single-file refactors) where the per-token cloud bill is the dominant cost.
- You already own a 24GB+ GPU — the marginal cost of inference is then genuinely zero.
- You want a no-rate-limit, offline-capable assistant and can accept the quality step-down on hard tasks.
Don't self-host (or go hybrid) when:
- Your work is dominated by hard, ambiguous, multi-file or whole-repo tasks — the frontier-model gap is exactly where it costs you the most.
- You don't already have the hardware and your usage wouldn't amortize a GPU purchase against cloud spend.
- You need the absolute best result on a task and the cost is irrelevant relative to engineer time saved — a frontier cloud agent still wins outright there.
The pragmatic answer for most teams is hybrid: local for the 70–80% of routine, privacy-sensitive, high-volume work, cloud for the hard 20–30%. Continue.dev's role routing makes this a configuration decision rather than a tooling rewrite. For the broader question of self-hosting open models well — serving, quantization, throughput, and the operational reality of running model infrastructure — see our self-hosting LLMs complete guide for 2026.
FAQ
Can a 24GB GPU really run a useful coding agent?
Yes, with realistic expectations. An MoE model like qwen3-coder:30b at Q4_K_M (~17–19GB weights) leaves room for a usable context window on a single 24GB card and gives you 30B-class quality at the speed of a ~3.3B model. It is a strong assistant and a competent autonomous worker on bounded, single-file tasks. It is not a frontier-cloud replacement on sprawling multi-file work — community benchmarks consistently show that gap.
Why does my local agent fail or loop partway through a task?
Almost always the Ollama context window. Ollama defaults models to a small num_ctx (commonly 2048–4096), and an autonomous agent exceeds that within a few tool calls, after which it silently fails or loops. Build a custom Modelfile with PARAMETER num_ctx 65536 and ollama create a new tag — the Modelfile value takes precedence over env vars and the model's baked-in default. This is the single most impactful reliability fix.
Should I use Cline or Continue.dev?
Use Cline if you want autonomous, agentic behavior — it plans, edits files, and runs commands, the closest local experience to Claude Code. Use Continue.dev if you want assisted chat/edit plus fast autocomplete with clean role separation across VS Code, JetBrains, and Neovim, and especially if you want to route some roles to cloud and keep the rest local. Many teams run both: Continue for inline work, Cline for bounded autonomous tasks.
What quant should I pick for local coding?
Default to Q4_K_M — it is the standard quality/size balance and what most local coding setups should use. Going below Q4 (Q3, Q2) saves VRAM but the code-correctness drop is steep and rarely worth it for an agent expected to produce working code. Spend any extra VRAM on a larger context window or a bigger model before you spend it on a higher quant of a smaller one.
Is a local model really as good as cloud Claude?
Not on the hardest tasks. Community benchmarks put a strong local 32B-class model at roughly 85–90% of cloud Claude Sonnet on straightforward single-function work and well behind on complex multi-file reasoning and subtle bug detection. Treat "~70–85% of cloud quality on everyday work" as a directional, community-reported figure that varies by language, task, and context discipline — not a benchmark guarantee.
Does self-hosting actually save money?
If you already own a 24GB+ GPU, yes — marginal inference cost is zero, with no per-token bill, rate limits, or per-developer caps to administer. If you'd have to buy the hardware, run the amortization against your actual cloud spend and usage volume; for low usage, cloud is often cheaper than a GPU purchase. The strongest non-cost argument is privacy, which no amount of cloud spend buys back.
Can I run this fully offline?
Yes. Once Ollama and the model weights are pulled, the entire stack — Ollama, the model, Cline or Continue.dev — runs with no internet connection. Your code never leaves the machine and there is no external dependency at inference time, which is exactly why it works on air-gapped or locked-down networks.
If you're hiring vetted remote developers experienced with self-hosted LLM tooling — Ollama serving, local agent integration, quantization tradeoffs, and hybrid local/cloud routing in production — codersera.com/hire matches you with engineers who have shipped this kind of infrastructure, with a risk-free trial so you can validate technical fit before committing.