Run Devstral 2 Locally with Ollama (April 2026 Guide)
Last updated April 2026 — refreshed for Devstral 2 (123B) and Devstral Small 2 (24B), Ollama 0.22.x, and current SWE-bench Verified numbers.
Devstral has gone from a single 24B checkpoint in May 2025 to a two-tier family by December 2025: Devstral 2 123B (modified MIT, 72.2% on SWE-bench Verified) and Devstral Small 2 24B (Apache 2.0, 68.0% on SWE-bench Verified). Both run locally on Ollama. This guide is the practical, no-fluff version: which model fits which machine, the exact ollama commands for the latest tags, the quantizations that actually fit in 24/32/48 GB of VRAM, and the troubleshooting steps that come up most often when you wire Devstral into Cline, Continue, or Mistral's own Vibe CLI.
What changed in 2026Devstral 2 (123B) shipped December 9, 2025 alongside Devstral Small 2 (24B) and the new Mistral Vibe CLI. The original "Devstral-Small-2505" checkpoint is now two generations behind.SWE-bench Verified jumped from 46.8% (Devstral Small 2505) → 53.6% (Small 2507) → 68.0% (Small 2 / 2512) and 72.2% (Devstral 2 123B). Small 2 punches above models 5× its size.Context window doubled to 256K tokens on the 2512 checkpoints (up from 128K on 2505/2507).License split: Devstral Small 2 is Apache 2.0; Devstral 2 (123B) ships under a modified MIT license — read it before you bake it into a commercial product.Ollama tags changed.devstralstill pulls the legacy 24B 2505 build. Usedevstral-2for the 123B and the2512-suffixed community GGUFs (Unsloth, Bartowski) for the new 24B.Ollama 0.22.0 (April 28, 2026) is the current stable release; 0.21.x added the launch wizard that auto-wires Claude Code, Cursor, Continue, and GitHub Copilot CLI to local models.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
TL;DR — which Devstral, which hardware
| Model | Params | SWE-bench Verified | License | Min hardware (Q4_K_M) | Ollama pull |
|---|---|---|---|---|---|
| Devstral Small 2 (2512) | 24B | 68.0% | Apache 2.0 | 1× RTX 4090 (24 GB) or Mac 32 GB | community GGUF via ollama pull hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF |
| Devstral 2 (2512) | 123B | 72.2% | Modified MIT | ~75 GB on disk; 4×24 GB VRAM or 1× H100 80 GB | ollama pull devstral-2 |
| Devstral Small 2507 (legacy) | 24B | 53.6% | Apache 2.0 | 1× RTX 4090 or Mac 32 GB | ollama pull devstral (still the default tag) |
If you have a single 24 GB consumer GPU or a 32 GB Mac, run Devstral Small 2 at Q4_K_M. If you have a workstation with multiple GPUs (4×24 GB or 1× H100), run Devstral 2 123B. Skip the legacy 2505/2507 builds unless you have a reproducibility reason — Small 2 is strictly better at the same size and license.
Why run Devstral locally at all
- Privacy. Repository contents, prompts, and chain-of-thought stay on the box. This is the only reason most teams care.
- Cost predictability. Devstral 2 via API is free during the preview window, but Devstral Medium 2507 was $0.40 / $2.00 per million input/output tokens — local inference has zero marginal cost after the GPU is bought.
- Latency. First-token latency on a 4090 with Q4_K_M is sub-200 ms; cloud APIs typically sit at 400–900 ms.
- Offline / air-gapped. Defense, finance, and regulated-health workflows can't ship code to a third-party endpoint.
- Tool-use for agents. Devstral was trained for agentic scaffolds (OpenHands, Cline, Vibe CLI) and supports native function calling — it's the model the open community actually uses for autonomous code-edit loops.
If you're comparing this to closed-source coding models before committing, our DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) walks through where Devstral 2 sits against Claude 4.7 Sonnet, GPT-5.5, and DeepSeek V4 on the same SWE-bench harness.
Prerequisites
- Hardware:
- Devstral Small 2 (24B) Q4_K_M: 24 GB VRAM (RTX 3090 / 4090 / 5090) or Apple Silicon with 32 GB unified memory.
- Devstral Small 2 at full 256K context: ~35 GB VRAM with the Unsloth UD Q6_K_XL quant — needs a 48 GB card (RTX A6000, 6000 Ada) or a 64 GB+ Mac.
- Devstral 2 (123B) Q4: ~75 GB on disk, 4×24 GB VRAM split or a single 80 GB H100 / 96 GB H200.
- Disk: 20 GB for Small 2 Q4, 50 GB for Small 2 Q8, ~75 GB for Devstral 2 123B Q4.
- OS: Linux (Ubuntu 22.04+ or Fedora 40+), macOS 14+, or Windows 11 with WSL2. Native Windows works but the Linux path is better tested.
- Ollama:
0.22.0or newer. Older builds don't ship the launch wizard, OpenClaw web search, or thethinkreasoning-effort parameter.
Step 1 — Install Ollama 0.22.x
macOS (Homebrew):
brew install ollama
ollama --version # expect 0.22.xLinux (official installer):
curl -fsSL https://ollama.com/install.sh | sh
systemctl --user enable --now ollama
ollama --versionWindows: download the MSI from ollama.com/download. The installer registers Ollama as a service and exposes the same CLI.
If you already had Ollama installed, upgrade in place — the 0.21 → 0.22 jump added batched sampling and the OpenClaw web-search plugin, both worth having.
Step 2 — Pull the right Devstral build
Devstral Small 2 (24B, recommended for one GPU)
The official devstral tag on ollama.com still points to the May-2025 (2505) build at the time of writing. To get the December 2025 (2512) Small 2 weights today, pull the community GGUF directly from Hugging Face — Ollama supports any GGUF on HF Hub:
# Q4_K_M — fits on a 24 GB card, ~14 GB on disk
ollama pull hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M
# Q8_0 — better quality, needs ~26 GB VRAM or a 48 GB card / 64 GB Mac
ollama pull hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0
# Bartowski mirror (alternative)
ollama pull hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_MRun it:
ollama run hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_MDevstral 2 (123B, multi-GPU / H100 class)
ollama pull devstral-2 # 123B, ~75 GB
ollama run devstral-2The 123B build supports a 256K context window. Cloud-only variant: devstral-2:123b-cloud runs the same weights on Ollama's hosted infrastructure — useful for benchmarking before you provision an H100 yourself.
Legacy 2505 / 2507 (only if you need exact reproducibility)
ollama pull devstral # 2505/2507 lineage, 24B, 14 GBStep 3 — Quick functional test
Drop into the REPL and confirm the model responds:
ollama run hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M
>>> Write a Python function that flattens an arbitrarily nested list of integers, with a doctest.For programmatic use, hit the local REST API:
curl http://localhost:11434/api/chat -d '{
"model": "hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M",
"messages": [
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "Refactor this Python loop into a list comprehension: ..."}
],
"options": {"temperature": 0.15, "num_ctx": 32768}
}'Devstral was trained at temperature=0.15 for agent runs — keep it low. The full 256K context is available but expensive in VRAM; num_ctx: 32768 is a sane default for chat-style use.
Step 4 — Wire it into your editor
Cline (VS Code)
- Install the Cline extension.
- Settings → API Provider → Ollama. Base URL:
http://localhost:11434. Model ID: the exact tag you pulled. - Set tool-use mode to native function calling — Devstral 2 supports both Mistral function calling and XML formats.
Continue (VS Code / JetBrains)
Add to ~/.continue/config.yaml:
models:
- title: Devstral Small 2
provider: ollama
model: hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M
contextLength: 32768
completionOptions:
temperature: 0.15Mistral Vibe CLI
Vibe is Mistral's first-party agent CLI for Devstral, released alongside Devstral 2. Point it at your local Ollama instance with VIBE_BASE_URL=http://localhost:11434/v1 and VIBE_MODEL=devstral-2 (or the Small 2 tag). The Agent Communication Protocol means it plugs into IDEs that already speak ACP.
Ollama 0.21+ launch wizard
If you want all of the above wired up in one shot, run ollama launch. The wizard added in 0.21.0 detects installed agents (Claude Code, Cursor, Continue, Hermes, GitHub Copilot CLI, Kimi CLI) and writes their config files for you against the local model you select.
Performance and benchmarks (verified, April 2026)
| Model | SWE-bench Verified | Source |
|---|---|---|
| Devstral 2 (123B, 2512) | 72.2% | Mistral AI announcement, Dec 9, 2025 |
| Devstral Small 2 (24B, 2512) | 68.0% | Mistral AI announcement, Dec 9, 2025 |
| Devstral Medium (2507) | 61.6% | Mistral AI devstral-2507 blog |
| Devstral Small 1.1 (2507) | 53.6% | Mistral AI devstral-2507 blog |
| Devstral Small (2505, original) | 46.8% | Original Devstral release, May 2025 |
Real-world throughput on consumer hardware (Q4_K_M, OpenHands scaffold, single 4090, our internal runs — your numbers will vary):
- Devstral Small 2 24B: ~35–45 tokens/sec generation, ~50 ms first-token latency at 8K context.
- Devstral Small 2 24B at 64K context: ~22–28 tokens/sec (KV cache dominates VRAM).
- Devstral 2 123B Q4 split across 4×24 GB: ~12–18 tokens/sec; on a single H100 80 GB: ~28–34 tokens/sec.
Independent leaderboards (SWE-rebench, llm-stats.com, Epoch AI) list the 2512 checkpoints with broadly consistent numbers; Devstral 2 sits in the same "open-weight code agent" cluster as the latest GLM-4.6 coder and Qwen3-Coder builds, ahead of any sub-30B open model.
How to choose: a 30-second decision tree
- Single consumer GPU (16–24 GB) → Devstral Small 2 (24B) at Q4_K_M, context 16K–32K. Don't try the 123B; it'll either OOM or run at <5 tok/s on offload.
- 32 GB+ Apple Silicon → Devstral Small 2 at Q4_K_M or Q5_K_M. Use the MLX runner (improved in Ollama 0.21.1+); real-time on M3/M4 Max.
- 48 GB pro card or 64 GB Mac → Devstral Small 2 at Q8_0, full 256K context. This is the sweet spot for agent runs over large repos.
- Multi-GPU workstation or H100 → Devstral 2 (123B). Worth it for harder agentic tasks and longer plans.
- You need a non-Apache permissive license → Devstral Small 2 (Apache 2.0). Devstral 2's modified MIT has carve-outs; have legal read it before redistributing.
- You want closed-source-tier quality and don't care about local → API to Devstral Medium 2507 or Devstral 2, or compare against Claude 4.7 Sonnet / GPT-5.5 / DeepSeek V4 in our pillar comparison linked above.
Common pitfalls and troubleshooting
- Pulling
ollama pull devstraland getting the 2505 build. The defaultdevstraltag on ollama.com has not been re-pointed to 2512 at the time of writing. If you want Small 2, pull the HF GGUF explicitly (hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M). - OOM at long contexts. 256K context is real, but the KV cache grows linearly. On a 24 GB card, cap
num_ctxat 32–48K for Q4_K_M, lower for Q8_0. - llama.cpp / Ollama complaining about missing op. Devstral Small 2 requires the rope-scaling + attention-temperature fixes from llama.cpp PR #17945. Update llama.cpp or use Ollama 0.21.2+ where it's already merged.
- Slow tokens/sec on Apple Silicon. Make sure you're on Ollama 0.21.1+ — the MLX runner shipped fused top-P/top-K sampling that's roughly 30% faster than the prior path.
- Tool calls returning malformed JSON. Devstral supports both Mistral and XML tool-call formats; Cline and Continue default to OpenAI-style. Set Cline's tool-use mode to "Mistral function calling" or downgrade to XML if your scaffold doesn't support it.
- Confusing "Devstral 2" with "Devstral Medium". Devstral Medium (2507) is closed-weight and API-only. Devstral 2 (2512) is open-weight 123B. They are different models.
- Treating Ollama as a security boundary. Ollama listens on
0.0.0.0:11434in some configurations. Bind to127.0.0.1or put it behind a firewall — there is no auth on the API.
What was removed and why
- The 24 GB / RTX 4090 floor as the only option. The 2505 post implied a single hardware path. With Devstral 2 (123B) now in the lineup, multi-GPU and H100 workflows are first-class.
- Devstral-Small-2505 as the recommended build. Superseded by Small 2 (2512), which scores +21 points on SWE-bench Verified at the same parameter count and same Apache 2.0 license. No reason to start a new project on 2505.
- "32 GB RAM" as a one-size-fits-all spec. That number is right for Q4 on a Mac; it's wrong for Q8 (needs ~48 GB) and meaningless for the 123B model.
Need engineers who already know this stack
If you're standing up a private code-agent platform on Devstral and don't want to spend two months hiring for it, Codersera places vetted remote developers with hands-on local-LLM and agentic-tooling experience — usually within two weeks. The same talent pool covers the adjacent skills (vLLM serving, Ollama at scale, Cline / Continue / OpenHands integration) that this kind of deployment touches.
FAQ
What's the difference between Devstral, Devstral 2, and Devstral Medium?
Devstral (May 2025, 24B, "2505") was the original release. Devstral Small 1.1 / Medium (July 2025, "2507") was the second generation — Small 1.1 is open-weight 24B, Medium is closed-weight API-only. Devstral 2 and Devstral Small 2 (December 2025, "2512") are the current generation: 123B open-weight (modified MIT) and 24B open-weight (Apache 2.0) respectively.
Can I run Devstral 2 (123B) on a single 24 GB GPU?
Not at usable speed. The Q4 weights are ~75 GB; you'd be CPU-offloading most of the model and getting under 5 tokens/sec. Use Devstral Small 2 (24B) on consumer hardware. Reach for Devstral 2 only when you have 4×24 GB, an H100/H200, or you're using devstral-2:123b-cloud.
Is Ollama the best way to run Devstral, or should I use vLLM / llama.cpp?
Ollama is the lowest-friction option and is what most IDE plugins target. vLLM is faster for multi-tenant serving and supports Mistral's official tool-call parser via --tool-call-parser mistral. llama.cpp gives you the most quantization choices (Q3_K, IQ4_KSS, etc.) and runs anywhere. Pick Ollama for desktop/agent use, vLLM for production inference, llama.cpp for tight VRAM budgets.
What context length should I actually use?
The model supports 256K, but useful retrieval drops well before that and KV cache is expensive. For interactive coding, 16K–32K. For repo-scale agent runs, 64K–128K with a structured retrieval scaffold (OpenHands, Vibe). Going past 128K rarely pays for itself.
Can I fine-tune Devstral Small 2?
Yes — Apache 2.0 permits it, Unsloth has working notebooks for the 2512 build, and the base is Mistral-Small-3.1-24B-Base-2503. LoRA on a 4090 is realistic; full fine-tune wants multiple H100s.
Does Devstral leak code or telemetry?
Run via Ollama on localhost, no. Ollama doesn't phone home model inputs. The HF download itself is the only network call. If you're on the corporate network, mirror the GGUFs once and pin Ollama to a private registry.
Where does Devstral fit against Claude 4.7, GPT-5.5, DeepSeek V4?
Devstral 2 (123B) is the strongest open-weight code agent at SWE-bench Verified 72.2%, but the leading closed models are still ahead at the top of the leaderboard. The right way to read it: pick Devstral if you need local / private / cheap; pick a frontier API if you need the absolute best result on the hardest tasks. The full comparison lives in the 2026 DeepSeek V4 vs Claude vs GPT-5 coding model comparison.
References & further reading
- Mistral AI — Introducing: Devstral 2 and Mistral Vibe CLI (Dec 9, 2025)
- Mistral AI — Upgrading agentic coding capabilities with the new Devstral models (2507)
- Hugging Face — mistralai/Devstral-Small-2-24B-Instruct-2512 model card
- Hugging Face — mistralai/Devstral-2-123B-Instruct-2512 model card
- Hugging Face — Unsloth Devstral Small 2 GGUF (Q4_K_M, Q8_0, UD)
- Ollama — devstral-2 model library page
- Ollama — devstral model library page (legacy 2505/2507)
- GitHub — ollama/ollama releases (0.22.0, April 28, 2026)
- SWE-bench Verified — official leaderboard
- SWE-rebench — independent SWE-bench leaderboard (lists Devstral 2 / Devstral Small 2 2512)
- Unsloth — Devstral 2 fine-tune & run guide
- r/LocalLLaMA — practitioner discussions on local Devstral runs, quantizations, and scaffolds