Install and Run Cherry Studio with Ollama on Windows (2026 Guide)

Install and Run Cherry Studio with Ollama on Windows (2026 Guide)

Last updated April 2026 — refreshed for Cherry Studio v1.9.3, Ollama v0.22.0, and the current 2026 model lineup (Gemma 3, Qwen 3, Llama 3.3, DeepSeek-R1, Phi-4).

Cherry Studio is an open-source desktop client (AGPL-3.0) that gives Windows users a single interface for both cloud LLMs (OpenAI, Anthropic, Gemini, DeepSeek, Mistral) and local engines like Ollama. This guide walks through a clean Windows 10/11 install of Ollama 0.22.0 and Cherry Studio 1.9.3, picks 2026-current models that actually fit your hardware, and shows the exact configuration values that work on the first try.

What changed in 2026 (read this first):Cherry Studio 1.9.x renamed the old "Plugins" system to Skills, added the CherryClaw autonomous-agent layer with scheduled tasks and Telegram/Discord/Slack integrations, and shipped a built-in Cherry Assistant for in-app diagnostics. The repo is at ~44.8k GitHub stars.Ollama 0.22.0 (April 28, 2026) is the current stable. The Windows installer lives at %LOCALAPPDATA%\Programs\Ollama, requires no admin rights, and bundles its own CUDA toolkit so NVIDIA GPUs (compute capability 5.0+, driver 531+) are auto-detected.Default models have moved on. Llama 2 and Mistral 7B v0.3 are still pullable but rarely the right starting point in 2026. The current sweet-spot picks are gemma3:4b, qwen3:8b, llama3.3:8b, phi4:14b, and deepseek-r1:7b/14b for reasoning.AMD GPUs on Windows now work via ROCm v6.1 for RX 7000/6000 and select Radeon PRO cards; an experimental Vulkan path covers everything else. AMD support remains stronger on Linux than Windows.Cherry Studio supports MCP (Model Context Protocol) servers, so any Ollama model with tool-calling can be wired into filesystem, web-search, GitHub, Postgres, or your own custom MCP servers — no separate orchestration needed.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — the 5-minute path

StepCommand / ActionTime
1. Install OllamaDownload installer from ollama.com/download/windows, run, accept defaults~2 min
2. Pull a modelollama pull gemma3:4b (3.3 GB) or qwen3:8b (5.2 GB)~3 min on 100 Mbps
3. Verify daemonBrowse to http://localhost:11434 — should return "Ollama is running"~10 s
4. Install Cherry StudioGrab v1.9.3 .exe from GitHub releases, install, launch~1 min
5. Wire it upSettings → Model Providers → Ollama → toggle on → API URL http://localhost:11434 → Add model gemma3:4b~30 s

Why Cherry Studio + Ollama in 2026

Ollama is the runtime: it pulls quantised GGUF model files from its registry, manages a local OpenAI-compatible HTTP server on port 11434, and handles GPU offload. It's not a UI — and that's the point. Cherry Studio is the UI: chat threads, document RAG, agent workflows, MCP server orchestration, voice input, image generation, and a unified provider list that lets you flip between a local Ollama model and Claude 4.7 or GPT-5.5 without changing apps. FOSS Force and multiple 2026 reviews note Cherry Studio has become one of the top-three desktop clients for local LLMs, alongside LM Studio and Open WebUI.

If you're standing up an end-to-end local agent stack (not just a chat window), the broader walkthrough — OpenClaw + Ollama setup guide for running local AI agents — extends this same Ollama install into a multi-tool agent pipeline.

System requirements (verified April 2026)

Cherry Studio 1.9.3

  • OS: Windows 10 (1809+) or Windows 11. Also runs on macOS 11+ and Linux (Flatpak, DEB, RPM, AUR, Nix).
  • CPU: Any modern x64 (Intel 8th-gen+, AMD Zen+).
  • RAM: 4 GB for the Electron app itself; total RAM budget is dominated by the model you load.
  • Disk: ~600 MB for Cherry Studio install. Models live under Ollama's data dir.

Ollama 0.22.0 — what your hardware can actually run

  • NVIDIA GPU: Compute capability 5.0+ (Maxwell or newer — practically any GTX 900-series and up), driver 531+. RTX 30/40/50 series are the sweet spot. Ollama bundles CUDA, so no separate toolkit install is needed.
  • AMD GPU on Windows: ROCm v6.1 covers RX 7900 XTX/XT, 7800 XT, 7700 XT, 7600/XT, 6950 XT, 6900 XT, 6800/XT, plus Radeon PRO W7000/W6000 series. Older AMD cards fall back to CPU or experimental Vulkan.
  • CPU-only: Works for sub-4B models (e.g. gemma3:1b, llama3.2:3b) at usable speed (~10–25 tok/s on a recent x64 CPU). Don't bother with 7B+ on CPU unless you have 64 GB RAM and patience.
  • RAM/VRAM by model size (Q4_K_M quantisation): 1B–3B → 2–4 GB; 7B–8B → 5–8 GB; 13B–14B → 9–11 GB; 27B–32B → 18–24 GB; 70B → ~45 GB.
  • Disk: Plan for 20–50 GB if you want to keep 3–5 models around.

Step 1 — Install Ollama for Windows

  1. Visit ollama.com/download/windows and download OllamaSetup.exe.
  2. Run it. The installer drops binaries in %LOCALAPPDATA%\Programs\Ollama, adds that path to your user PATH, and registers a background service. No admin rights required.
  3. Open PowerShell and verify:
ollama --version
# expected: ollama version is 0.22.0 (or newer)

ollama list
# empty initially — that's fine

If ollama isn't recognised, close and reopen PowerShell (PATH refresh) or sign out and back in.

Optional: relocate the model store

Models go to %HOMEPATH%\.ollama\models by default. Each model is multiple GB — if your C: drive is small, point Ollama at a bigger drive:

# In an elevated PowerShell, set a User-level env var:
[Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama-models", "User")

# Restart the Ollama tray app for the change to take effect.

Optional: expose Ollama on your LAN

[Environment]::SetEnvironmentVariable("OLLAMA_HOST", "0.0.0.0:11434", "User")

Only do this on a trusted network — Ollama has no auth.

Step 2 — Pull a 2026-current model

Pick one based on your VRAM. All commands assume PowerShell.

# Tiny — runs anywhere, even CPU-only laptops
ollama pull gemma3:1b           # 815 MB

# Recommended starting point — balanced quality/speed on 8 GB VRAM
ollama pull gemma3:4b           # 3.3 GB
ollama pull llama3.2:3b         # 2.0 GB

# Best general-purpose chat for 8–12 GB VRAM
ollama pull qwen3:8b            # 5.2 GB
ollama pull llama3.3:8b         # 4.9 GB

# Reasoning (chain-of-thought) — DeepSeek-R1 distilled
ollama pull deepseek-r1:7b      # 4.7 GB
ollama pull deepseek-r1:14b     # 9.0 GB

# Coding assistant — best HumanEval scores at this size
ollama pull qwen2.5-coder:14b   # 9.0 GB

# Punches above its weight on math/logic — needs 16 GB VRAM
ollama pull phi4:14b            # 9.1 GB

Quick smoke test from the CLI before plumbing in Cherry Studio:

ollama run gemma3:4b "Write a one-sentence summary of the Goldilocks principle."

If you see a coherent reply, the runtime is healthy. Ctrl+D to exit.

Step 3 — Install Cherry Studio 1.9.3

  1. Go to github.com/CherryHQ/cherry-studio/releases and grab Cherry-Studio-1.9.3-x64-setup.exe (or the portable build if you don't want a system install).
  2. SmartScreen may flag the installer because Cherry Studio doesn't sign Windows binaries with an EV cert. Click More info → Run anyway. The release artefact's SHA-256 is published on the GitHub releases page — verify if you're cautious.
  3. Run the installer. Launch Cherry Studio. On first run it asks for a language and theme.

Step 4 — Wire Cherry Studio to Ollama

  1. Click the gear icon in the bottom-left of Cherry Studio.
  2. Open Model Providers (also labelled "Model Services" in older builds).
  3. Find Ollama in the provider list and toggle it on.
  4. Set the fields exactly as below:
FieldValue
API Host / API Addresshttp://localhost:11434
API Keyleave blank (Ollama doesn't require one)
Keep Alive5m (default) — bump to 30m if you switch chats often and don't want the model unloaded between turns
  1. Click Manage (or + Add) and enter the exact tag you pulled — e.g. gemma3:4b, qwen3:8b. Cherry Studio doesn't auto-list local Ollama models in every build; if yours doesn't, type tags in manually.
  2. Open a new chat, pick Ollama as the provider and your model as the target. Send a prompt — first response will be a touch slow as the model loads into VRAM, then subsequent tokens stream at full GPU speed.

Performance — what to actually expect in 2026

Concrete numbers, sourced from public 2026 benchmarks (Hugging Face open-LLM roundup, Local AI Master's Ollama leaderboard, and Microsoft's Phi-4 technical report). All scores are for the unquantised model unless noted.

ModelSize on disk (Q4_K_M)Min VRAMHumanEval (code)MATHMMLUBest for
Gemma 3 4B3.3 GB4 GB~63%~50%~67%Daily chat on 6–8 GB GPUs
Llama 3.3 8B4.9 GB6 GB~72%68.0%~71%General workhorse
Qwen 3 8B5.2 GB6 GB~78%~74%~73%Multilingual + tool use
DeepSeek-R1 14B (distill)9.0 GB10 GB~74%~83%~74%Reasoning, math, agents
Phi-4 14B9.1 GB12 GB~80%80.4%~76%Logic, structured problem solving
Qwen2.5-Coder 14B9.0 GB12 GB89.2%~67%~70%Coding (top open model in its size class)
Qwen2.5-Coder 32B20 GB24 GB92.7%~75%~74%Coding on 24 GB GPUs (RTX 3090/4090/5090)
Llama 3.3 70B43 GB48 GB~83%~77%~83%Best local quality if you have the VRAM

Real-world throughput on a single RTX 4070 (12 GB) running Q4_K_M is roughly 60–90 tok/s for 7B–8B models and 30–45 tok/s for 14B models. CPU-only on a Ryzen 7 5800X tops out around 15 tok/s for a 3B model and is uncomfortably slow above that.

How to choose your model — decision tree

  • Laptop, integrated GPU, ≤16 GB RAM: gemma3:1b or llama3.2:3b. Forget anything bigger.
  • 8 GB VRAM (RTX 3060, 4060, 4070 laptop): gemma3:4b for chat, qwen3:8b for general use, deepseek-r1:7b for reasoning.
  • 12 GB VRAM (RTX 4070, 5070, 3080 12 GB): add phi4:14b and qwen2.5-coder:14b to your roster.
  • 16–24 GB VRAM (RTX 4080/4090, 5080/5090): qwen2.5-coder:32b for coding, deepseek-r1:32b for reasoning.
  • 48 GB+ VRAM (dual-GPU, A6000, RTX 6000 Ada): llama3.3:70b — best local quality there is in 2026 outside of MoE giants.
  • You want vision (images in/out): llava:13b or gemma3:27b (Gemma 3 has native vision in this gen).
  • You want function-calling / MCP tools: qwen3, llama3.3, and deepseek-r1 all behave well — pair them with Cherry Studio's MCP server tab.

Advanced — features worth knowing in Cherry Studio 1.9.x

RAG over your own documents

Drag a PDF, Word doc, Markdown, or whole folder into a chat. Cherry Studio chunks, embeds (using either an Ollama embedding model like nomic-embed-text or a cloud embedder), and retrieves on every turn. Local-only RAG with a local embedder means your documents never leave the machine.

MCP servers

Settings → MCP. Cherry Studio ships pre-configured connectors (filesystem, web search, GitHub, fetch, memory). Add your own by URL or stdio command. Pair with a tool-using model like qwen3:8b or llama3.3:8b and Cherry Studio will route tool calls automatically.

Skills (formerly Plugins)

The 1.9.1 rename consolidated the old plugin marketplace into a unified Skills panel — translation, summarisation, prompt templates, and custom JS skills.

CherryClaw autonomous agent

1.9.x ships CherryClaw: a personality-driven agent layer with scheduled tasks and Telegram, Discord, Slack, Feishu, WeChat, and QQ integrations. Useful if you want a 24/7 local-LLM bot doing scheduled work; less useful as a daily chat surface.

Common pitfalls and troubleshooting

  • "Connection refused" on localhost:11434. Ollama tray app isn't running. Open Start → Ollama, or run ollama serve in PowerShell. Check that nothing else (LM Studio's server, a custom proxy) has bound the same port.
  • Model loads into RAM instead of VRAM. Either your VRAM is too small for the chosen quantisation, or your driver is older than 531. Update via GeForce Experience or the AMD installer. Watch nvidia-smi while sending a prompt — VRAM should jump.
  • Throughput collapses after a few turns. Long context fills the KV cache. Cap context with OLLAMA_NUM_CTX (e.g. 8192) or pick a model with stronger long-context handling (Qwen 3 supports 128K).
  • SmartScreen blocks the Cherry Studio installer. Expected — the project doesn't ship EV-signed binaries. Verify the SHA-256 against the GitHub release page if you're cautious, then "Run anyway".
  • Chinese UI strings appear sporadically. Cherry Studio is a Chinese-led project; some agent prompts and changelog text default to zh-CN. Settings → General → Language → English.
  • "Model not found" in Cherry Studio. The tag has to match the Ollama tag exactly, including the size suffix. Run ollama list and copy the first column verbatim.
  • RAG returns junk on long PDFs. Switch the embedding model to nomic-embed-text or mxbai-embed-large — both run cheaply on Ollama and beat the default tiny embedder.
  • AMD GPU not detected. Confirm your card is on the Windows ROCm v6.1 list (RX 7000, 6000 series, select W-series). Otherwise enable Vulkan path or fall back to CPU.

Privacy and security notes

  • By default, Ollama binds 127.0.0.1 only — local-loopback, not on the LAN. Setting OLLAMA_HOST=0.0.0.0 exposes it; do this only on trusted networks.
  • Ollama has no auth. If you need auth + multi-user, put it behind a reverse proxy (Caddy, Nginx) or use Open WebUI's auth layer in front.
  • Cherry Studio stores chat history locally in an Electron sqlite file under %APPDATA%\CherryStudio. Back this up if your conversations matter.
  • Both projects are AGPL-3.0 and the source is public — Cherry Studio at CherryHQ/cherry-studio, Ollama at ollama/ollama.

When local-LLM stacks make sense in production

Local Cherry Studio + Ollama is a great fit for individual engineers, R&D prototyping, and any workflow where data residency rules out cloud calls. It is not a drop-in replacement for production multi-user LLM serving — for that you'd reach for vLLM, TGI, or NVIDIA NIM. Teams shipping LLM-backed product features still typically want a senior backend engineer who's done the inference-serving work before; if that's where your roadmap is, Codersera helps companies extend engineering teams with vetted remote developers who already have the local-LLM and ML-infra reps in.

FAQ

Is Cherry Studio free?

Yes. The desktop app is AGPL-3.0 and free for personal and team use. There's a paid Enterprise Edition for commercial deployments that need centralised admin, SSO, and audit. Cloud-model usage costs whatever the underlying provider charges (OpenAI, Anthropic, etc.); local Ollama models are free.

Can Cherry Studio run fully offline?

Yes — once you've pulled an Ollama model and (optionally) a local embedding model, you can disconnect from the internet and continue chatting, doing RAG, and using local MCP servers. Cloud providers and any MCP server that talks to a remote API obviously won't work offline.

Cherry Studio vs LM Studio vs Open WebUI — which should I pick?

LM Studio is best if you want a built-in model browser and don't need cloud-provider mixing. Open WebUI is best if you want a multi-user web UI behind your home server. Cherry Studio is best if you want a single desktop app that mixes local Ollama with cloud LLMs (Claude 4.7, GPT-5.5, Gemini), plus first-class MCP, RAG, and agent support.

Why is Llama 2 missing from your model list?

It's superseded. Llama 3.3 8B beats Llama 2 13B on essentially every benchmark while being smaller. Use Llama 3.3, Llama 3.2, or Llama 3.1 (still maintained) instead. The llama2 tag in Ollama still works for reproducibility but isn't a 2026 recommendation.

Does Cherry Studio support image generation?

Yes — through cloud providers (OpenAI, Stability) and any local image model exposed via a compatible API. Pure-local image gen typically goes through ComfyUI or Stable Diffusion WebUI rather than Ollama, since Ollama focuses on language models.

Can I use Ollama from Cherry Studio on a different machine?

Yes. Set OLLAMA_HOST=0.0.0.0:11434 on the Ollama machine, open the firewall, and in Cherry Studio set API Host to http://<that-machine-ip>:11434. Watch the no-auth caveat — only do this on networks you control.

How big a model can a 16 GB MacBook / 16 GB Windows laptop run?

For shared system RAM (no dedicated GPU), stick to 7B–8B models in Q4_K_M. They'll fit but run at 5–15 tok/s on CPU. With a discrete GPU at 8 GB VRAM, the same 7B–8B models run at 60+ tok/s. 13B–14B in Q4_K_M needs 10–12 GB VRAM to be enjoyable.

What's the catch with DeepSeek-R1 distill models?

They're "distilled" — DeepSeek-R1's reasoning behaviour was used to fine-tune existing Llama and Qwen base models. They reproduce a lot of the chain-of-thought style at much smaller sizes (7B, 14B, 32B) than the 671B original. Quality is genuinely good for math, logic, and coding; expect verbose <think>-tagged reasoning before the final answer.

References & further reading