mistral

Run Devstral 2 Locally with Ollama (May 2026 Guide)

Published 25 May 2025 • Updated 31 May 2026 • 12 min read

Quick answer. Run Devstral 2 with Ollama using the official tags: ollama pull devstral-small-2 (24B, 68.0% SWE-bench Verified, fits a 24 GB RTX 4090 or 32 GB Mac at Q4_K_M) or ollama pull devstral-2 (123B, 72.2% SWE-bench Verified, needs 4×24 GB VRAM or a single H100/H200). Requires Ollama 0.13.3 or newer; 0.23.4 (May 13, 2026) is the documented baseline (0.24.0 shipped May 14).

Last updated May 2026 — refreshed for the official devstral-small-2 and devstral-2 Ollama library tags, Ollama 0.23.x, and the current SWE-bench Verified leaderboard.

Devstral has gone from a single 24B checkpoint in May 2025 to a two-tier family by December 2025: Devstral 2 (123B) (modified MIT, 72.2% on SWE-bench Verified) and Devstral Small 2 (24B) (Apache 2.0, 68.0% on SWE-bench Verified). Both now ship as first-party Ollama library models. This guide is the practical, no-fluff version: which model fits which machine, the exact ollama pull commands for the official tags, the quantizations that actually fit in 24/32/48 GB of VRAM, and the troubleshooting steps that come up most often when you wire Devstral into Cline, Continue, Claude Code, or Mistral's own Vibe CLI.

What changed since the last revision.

Official Ollama tags shipped. Both devstral-small-2 (24B) and devstral-2 (123B) now have first-party Ollama library entries. You no longer need to chase community GGUFs from Hugging Face for the 2512 weights — pulling devstral-small-2:24b gives you the same Q4_K_M build at ~15 GB.

Ollama 0.23.4 (May 13, 2026) is the documented baseline and 0.24.0 (May 14, 2026) added Codex App support. 0.23.0 introduced ollama launch, which auto-wires Claude Code, OpenCode, Codex, and Droid to a local model in one command. 0.23.2 cached /api/show responses for ~6.7× faster IDE loads. Vision input is a property of the Devstral Small 2 weights, not of ollama launch.

SWE-bench Verified positioning. The May 2026 leaderboard has shifted: GPT-5.5 leads at 88.7%, Claude Mythos Preview at 93.9% (in preview). The strongest open-weight cluster is now MiniMax M2.5 (80.2%), DeepSeek V4 Pro Max (80.6%), and Kimi K2.6 (80.2%). Devstral 2 (72.2%) and Devstral Small 2 (68.0%) sit in the next tier — still the best code-specific open-weight models you can run on consumer or single-H100 hardware, but no longer at the top of the open-weight leaderboard overall.

Context window: 256K tokens on the 2512 checkpoints (up from 128K on 2505/2507).

License split: Devstral Small 2 is Apache 2.0; Devstral 2 (123B) ships under a modified MIT license — read it before you bake it into a commercial product.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — which Devstral, which hardware

Model	Params	SWE-bench Verified	License	Min hardware (Q4_K_M)	Ollama pull
Devstral Small 2 (2512)	24B	68.0%	Apache 2.0	1× RTX 4090 (24 GB) or Mac 32 GB	`ollama pull devstral-small-2`
Devstral 2 (2512)	123B	72.2%	Modified MIT	~75 GB on disk; 4×24 GB VRAM or 1× H100 80 GB	`ollama pull devstral-2`
Devstral Small 2507 (legacy)	24B	53.6%	Apache 2.0	1× RTX 4090 or Mac 32 GB	`ollama pull devstral` (legacy default tag)

If you have a single 24 GB consumer GPU or a 32 GB Mac, run Devstral Small 2 at Q4_K_M. If you have a workstation with multiple GPUs (4×24 GB) or an H100, run Devstral 2 (123B). Skip the legacy 2505/2507 builds unless you have a reproducibility reason — Small 2 scores +21 points on SWE-bench Verified at the same parameter count and same Apache 2.0 license.

Why run Devstral locally at all

Privacy. Repository contents, prompts, and chain-of-thought stay on the box. This is the only reason most teams care.
Cost predictability. Devstral 2 via API is free during the preview window, but Devstral Medium 2507 was $0.40 / $2.00 per million input/output tokens — local inference has zero marginal cost after the GPU is bought.
Latency. First-token latency on a 4090 with Q4_K_M is sub-200 ms; cloud APIs typically sit at 400–900 ms. Ollama 0.23.2's API-show cache further cuts model-switching overhead inside IDE plugins.
Offline / air-gapped. Defense, finance, and regulated-health workflows can't ship code to a third-party endpoint.
Tool-use for agents. Devstral was trained for agentic scaffolds (OpenHands, Cline, Vibe CLI, Claude Code via ollama launch) and supports native function calling — it's the model the open community actually uses for autonomous code-edit loops.

If you're comparing this to closed-source coding models before committing, our DeepSeek V4 vs Claude vs GPT-5: AI coding model comparison (2026) walks through where Devstral 2 sits against Claude 4.7 Sonnet, GPT-5.5, and DeepSeek V4 on the same SWE-bench harness.

Prerequisites

Hardware:
- Devstral Small 2 (24B) Q4_K_M: 24 GB VRAM (RTX 3090 / 4090 / 5090) or Apple Silicon with 32 GB unified memory.
- Devstral Small 2 at full 256K context: ~35 GB VRAM with a Q6_K-class quant — needs a 48 GB card (RTX A6000, 6000 Ada) or a 64 GB+ Mac.
- Devstral 2 (123B) Q4: ~75 GB on disk, 4×24 GB VRAM split or a single 80 GB H100 / 96 GB H200.
Disk: ~15 GB for Small 2 Q4, ~26 GB for Small 2 Q8, ~75 GB for Devstral 2 123B Q4.
OS: Linux (Ubuntu 22.04+ or Fedora 40+), macOS 14+, or Windows 11 with WSL2. Native Windows works but the Linux path is better tested.
Ollama: 0.13.3 or newer is the documented minimum for devstral-small-2. We recommend 0.23.4 or newer for the launch wizard, the API-show cache, and the latest fixes.

How do I install Ollama 0.23.x?

macOS (Homebrew):

brew install ollama
ollama --version   # expect 0.23.x

Linux (official installer):

curl -fsSL https://ollama.com/install.sh | sh
systemctl --user enable --now ollama
ollama --version

Windows: download the MSI from ollama.com/download. The installer registers Ollama as a service and exposes the same CLI.

If you already had Ollama installed, upgrade in place — the 0.22 → 0.23 jump added ollama launch, the /api/show cache, and OpenCode vision support, all worth having.

Which `ollama pull` command do I use for Devstral 2?

As of December 2025, both Devstral 2 builds have official Ollama library entries — you don't need to chase community GGUFs on Hugging Face anymore.

Devstral Small 2 (24B, recommended for one GPU)

# Default tag — Q4_K_M, ~15 GB, fits on a 24 GB card
ollama pull devstral-small-2

# Explicit quantization tags
ollama pull devstral-small-2:24b-instruct-2512-q4_K_M   # ~15 GB
ollama pull devstral-small-2:24b-instruct-2512-q8_0     # ~26 GB
ollama pull devstral-small-2:24b-instruct-2512-fp16     # ~48 GB

# Run it
ollama run devstral-small-2

Devstral 2 (123B, multi-GPU / H100 class)

ollama pull devstral-2          # 123B, ~75 GB
ollama run devstral-2

The 123B build supports a 256K context window. There's also a hosted variant — devstral-2:123b-cloud — that runs the same weights on Ollama's infrastructure, useful for benchmarking before you provision an H100 yourself.

Legacy 2505 / 2507 (only if you need exact reproducibility)

ollama pull devstral            # 2505/2507 lineage, 24B, 14 GB

How do I verify the model runs?

Drop into the REPL and confirm the model responds:

ollama run devstral-small-2
>>> Write a Python function that flattens an arbitrarily nested list of integers, with a doctest.

For programmatic use, hit the local REST API:

curl http://localhost:11434/api/chat -d '{
  "model": "devstral-small-2",
  "messages": [
    {"role": "system", "content": "You are a senior software engineer."},
    {"role": "user", "content": "Refactor this Python loop into a list comprehension: ..."}
  ],
  "options": {"temperature": 0.15, "num_ctx": 32768}
}'

Devstral was trained at temperature=0.15 for agent runs — keep it low. The full 256K context is available but expensive in VRAM; num_ctx: 32768 is a sane default for chat-style use.

How do I wire Devstral into my editor?

Ollama launch (Claude Code, OpenCode, Codex, Droid)

Fastest path: ollama launch. The wizard detects installed agents (Claude Code, OpenCode, Codex, Droid) and writes their config files for you against the local model you select. Because Devstral Small 2 is natively multimodal, ollama launch opencode with a Devstral Small 2 model accepts image inputs out of the box.

ollama launch                     # interactive picker
ollama launch claude-code         # wire Claude Code directly
ollama launch opencode            # OpenCode (vision-capable with devstral-small-2)
ollama launch codex               # OpenAI Codex CLI / Codex App (0.24.0+)
ollama launch droid               # Droid

Cline (VS Code)

Install the Cline extension.
Settings → API Provider → Ollama. Base URL: http://localhost:11434. Model ID: devstral-small-2 (or devstral-2).
Set tool-use mode to native function calling — Devstral 2 supports both Mistral function calling and XML formats.

Continue (VS Code / JetBrains)

Add to ~/.continue/config.yaml:

models:
  - title: Devstral Small 2
    provider: ollama
    model: devstral-small-2
    contextLength: 32768
    completionOptions:
      temperature: 0.15

Mistral Vibe CLI

Vibe is Mistral's first-party agent CLI for Devstral, released alongside Devstral 2. Point it at your local Ollama instance with VIBE_BASE_URL=http://localhost:11434/v1 and VIBE_MODEL=devstral-2 (or devstral-small-2). The Agent Communication Protocol means it plugs into IDEs that already speak ACP.

Performance and benchmarks (verified, May 2026)

Model	SWE-bench Verified	Source
Devstral 2 (123B, 2512)	72.2%	Mistral AI announcement, Dec 9, 2025
Devstral Small 2 (24B, 2512)	68.0%	Mistral AI announcement, Dec 9, 2025
Devstral Medium (2507)	61.6%	Mistral AI devstral-2507 blog
Devstral Small 1.1 (2507)	53.6%	Mistral AI devstral-2507 blog
Devstral Small (2505, original)	46.8%	Original Devstral release, May 2025

Real-world throughput on consumer hardware (Q4_K_M, OpenHands scaffold, single 4090, our internal runs — your numbers will vary):

Devstral Small 2 24B: ~35–45 tokens/sec generation, ~50 ms first-token latency at 8K context.
Devstral Small 2 24B at 64K context: ~22–28 tokens/sec (KV cache dominates VRAM).
Devstral 2 123B Q4 split across 4×24 GB: ~12–18 tokens/sec; on a single H100 80 GB: ~28–34 tokens/sec.

Independent leaderboards (SWE-rebench, llm-stats.com, Epoch AI) list the 2512 checkpoints with broadly consistent numbers. The open-weight leaderboard moved in early 2026 — MiniMax M2.5, DeepSeek V4 Pro Max, and Kimi K2.6 now sit in the 80% band — but Devstral 2 remains the strongest code-specific open-weight model in the 24B–123B range and is the one the agent-tooling community has actually integrated against.

How to choose: a 30-second decision tree

Single consumer GPU (16–24 GB) → Devstral Small 2 (24B) at Q4_K_M, context 16K–32K. Don't try the 123B; it'll either OOM or run at <5 tok/s on offload.
32 GB+ Apple Silicon → Devstral Small 2 at Q4_K_M or Q5_K_M. The MLX runner (improved in Ollama 0.21+) is real-time on M3/M4 Max.
48 GB pro card or 64 GB Mac → Devstral Small 2 at Q8_0, full 256K context. This is the sweet spot for agent runs over large repos.
Multi-GPU workstation or H100 → Devstral 2 (123B). Worth it for harder agentic tasks and longer plans.
You need an Apache-permissive license → Devstral Small 2 (Apache 2.0). Devstral 2's modified MIT has carve-outs; have legal read it before redistributing.
You want closed-source-tier quality and don't care about local → API to Devstral Medium 2507 or Devstral 2, or compare against Claude 4.7 Sonnet / GPT-5.5 / DeepSeek V4 in our pillar comparison linked above.

Common pitfalls and troubleshooting

Pulling ollama pull devstral and expecting Small 2. The default devstral tag still points to the 2505/2507 lineage. For Small 2 you want devstral-small-2; for the 123B you want devstral-2. Both are official Ollama library entries.
OOM at long contexts. 256K context is real, but the KV cache grows linearly. On a 24 GB card, cap num_ctx at 32–48K for Q4_K_M, lower for Q8_0.
Old Ollama on the box. Devstral Small 2 needs Ollama 0.13.3 minimum, but several rope-scaling and sampler fixes only landed in 0.21–0.23. If you see garbled output, broken tool calls, or unexpectedly slow tokens, upgrade to the latest 0.23.x first.
Slow tokens/sec on Apple Silicon. Make sure you're on Ollama 0.21.1+ — the MLX runner shipped fused top-P/top-K sampling that's roughly 30% faster than the prior path.
Tool calls returning malformed JSON. Devstral supports both Mistral and XML tool-call formats; Cline and Continue default to OpenAI-style. Set Cline's tool-use mode to "Mistral function calling" or downgrade to XML if your scaffold doesn't support it.
Confusing "Devstral 2" with "Devstral Medium". Devstral Medium (2507) is closed-weight and API-only. Devstral 2 (2512) is open-weight 123B. They are different models.
Treating Ollama as a security boundary. Ollama listens on 0.0.0.0:11434 in some configurations. Bind to 127.0.0.1 or put it behind a firewall — there is no auth on the API.

What was removed and why

The Hugging Face GGUF detour. Earlier revisions of this guide told you to ollama pull hf.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF because there was no official 2512 tag. As of December 2025 both devstral-small-2 and devstral-2 are first-party Ollama library entries; the HF route is now optional.
The 24 GB / RTX 4090 floor as the only option. The 2505 post implied a single hardware path. With Devstral 2 (123B) in the lineup, multi-GPU and H100 workflows are first-class.
Devstral-Small-2505 as the recommended build. Superseded by Small 2 (2512), +21 points on SWE-bench Verified at the same parameter count and same Apache 2.0 license.
"32 GB RAM" as a one-size-fits-all spec. That number is right for Q4 on a Mac; it's wrong for Q8 (needs ~48 GB) and meaningless for the 123B model.

Need engineers who already know this stack

If you're standing up a private code-agent platform on Devstral and don't want to spend two months hiring for it, Codersera places vetted remote developers with hands-on local-LLM and agentic-tooling experience — usually within two weeks. The same talent pool covers the adjacent skills (vLLM serving, Ollama at scale, Cline / Continue / OpenHands / Claude Code integration) that this kind of deployment touches.

FAQ

What's the difference between Devstral, Devstral 2, and Devstral Medium?

Devstral (May 2025, 24B, "2505") was the original release. Devstral Small 1.1 / Medium (July 2025, "2507") was the second generation — Small 1.1 is open-weight 24B, Medium is closed-weight API-only. Devstral 2 and Devstral Small 2 (December 2025, "2512") are the current generation: 123B open-weight (modified MIT) and 24B open-weight (Apache 2.0) respectively, both available as official Ollama library tags.

What is the Ollama tag for Devstral 2?

The official Ollama library tags are devstral-2 (123B) and devstral-small-2 (24B). Pull either with ollama pull devstral-2 or ollama pull devstral-small-2. The 24B default tag is Q4_K_M (~15 GB); explicit quantization tags include devstral-small-2:24b-instruct-2512-q4_K_M, :q8_0, and :fp16.

Can I run Devstral 2 (123B) on a single 24 GB GPU?

Not at usable speed. The Q4 weights are ~75 GB; you'd be CPU-offloading most of the model and getting under 5 tokens/sec. Use Devstral Small 2 (24B) on consumer hardware. Reach for Devstral 2 only when you have 4×24 GB, an H100/H200, or you're using devstral-2:123b-cloud.

Is Ollama the best way to run Devstral, or should I use vLLM / llama.cpp?

Ollama is the lowest-friction option and is what most IDE plugins target. vLLM is faster for multi-tenant serving and supports Mistral's official tool-call parser via --tool-call-parser mistral. llama.cpp gives you the most quantization choices (Q3_K, IQ4_KSS, etc.) and runs anywhere. Pick Ollama for desktop/agent use, vLLM for production inference, llama.cpp for tight VRAM budgets.

What context length should I actually use?

The model supports 256K, but useful retrieval drops well before that and KV cache is expensive. For interactive coding, 16K–32K. For repo-scale agent runs, 64K–128K with a structured retrieval scaffold (OpenHands, Vibe). Going past 128K rarely pays for itself.

Can I fine-tune Devstral Small 2?

Yes — Apache 2.0 permits it, Unsloth has working notebooks for the 2512 build, and the base is Mistral-Small-3.1-24B-Base-2503. LoRA on a 4090 is realistic; full fine-tune wants multiple H100s.

Does Devstral leak code or telemetry?

Run via Ollama on localhost, no. Ollama doesn't phone home model inputs. The model pull itself is the only network call. If you're on the corporate network, mirror the model once and pin Ollama to a private registry.

Where does Devstral fit against Claude 4.7, GPT-5.5, DeepSeek V4?

The May 2026 SWE-bench Verified leaderboard has GPT-5.5 at 88.7%, Claude Opus 4.7 (Adaptive) at 87.6%, and the leading open-weight cluster — MiniMax M2.5, DeepSeek V4 Pro Max, Kimi K2.6 — in the low 80s. Devstral 2 (72.2%) sits below that tier but remains the most agent-tooling-integrated open-weight code model in the 24B–123B range. Pick Devstral if you need local / private / cheap; pick a frontier API if you need the absolute best result on the hardest tasks. The full comparison lives in the 2026 DeepSeek V4 vs Claude vs GPT-5 coding model comparison.

Run Devstral 2 Locally with Ollama (May 2026 Guide)

TL;DR — which Devstral, which hardware

Why run Devstral locally at all

Prerequisites

How do I install Ollama 0.23.x?

Which `ollama pull` command do I use for Devstral 2?

Devstral Small 2 (24B, recommended for one GPU)

Devstral 2 (123B, multi-GPU / H100 class)

Legacy 2505 / 2507 (only if you need exact reproducibility)

How do I verify the model runs?

How do I wire Devstral into my editor?

Ollama launch (Claude Code, OpenCode, Codex, Droid)

Cline (VS Code)

Continue (VS Code / JetBrains)

Mistral Vibe CLI

Performance and benchmarks (verified, May 2026)

How to choose: a 30-second decision tree

Common pitfalls and troubleshooting

What was removed and why

Need engineers who already know this stack

FAQ

What's the difference between Devstral, Devstral 2, and Devstral Medium?

What is the Ollama tag for Devstral 2?

Can I run Devstral 2 (123B) on a single 24 GB GPU?

Is Ollama the best way to run Devstral, or should I use vLLM / llama.cpp?

What context length should I actually use?

Can I fine-tune Devstral Small 2?

Does Devstral leak code or telemetry?

Where does Devstral fit against Claude 4.7, GPT-5.5, DeepSeek V4?

References & further reading

Sign up for more like this.

TL;DR — which Devstral, which hardware

Why run Devstral locally at all

Prerequisites

How do I install Ollama 0.23.x?

Which ollama pull command do I use for Devstral 2?

Devstral Small 2 (24B, recommended for one GPU)

Devstral 2 (123B, multi-GPU / H100 class)

Legacy 2505 / 2507 (only if you need exact reproducibility)

How do I verify the model runs?

How do I wire Devstral into my editor?

Ollama launch (Claude Code, OpenCode, Codex, Droid)

Cline (VS Code)

Continue (VS Code / JetBrains)

Mistral Vibe CLI

Performance and benchmarks (verified, May 2026)

How to choose: a 30-second decision tree

Common pitfalls and troubleshooting

What was removed and why

Need engineers who already know this stack

FAQ

What's the difference between Devstral, Devstral 2, and Devstral Medium?

What is the Ollama tag for Devstral 2?

Can I run Devstral 2 (123B) on a single 24 GB GPU?

Is Ollama the best way to run Devstral, or should I use vLLM / llama.cpp?

What context length should I actually use?

Can I fine-tune Devstral Small 2?

Does Devstral leak code or telemetry?

Where does Devstral fit against Claude 4.7, GPT-5.5, DeepSeek V4?

References & further reading

Sign up for more like this.

Which `ollama pull` command do I use for Devstral 2?