Last updated April 2026 — refreshed for current Cherry Studio v1.9.x and Ollama 0.22.x releases, current models (Llama 4, DeepSeek V4, Qwen 3, Gemma 4), and Ubuntu 24.04 LTS as the default target.
This is a working setup guide for running large language models entirely on your own Ubuntu box, with Cherry Studio as the desktop chat UI and Ollama as the local inference engine. No cloud, no API keys, no per-token billing — just a native AppImage talking to http://localhost:11434. Tested on Ubuntu 22.04 LTS and 24.04 LTS as of April 2026.
What changed in 2026 (read this first if you set this up before)Cherry Studio v1.9.3 (released April 24, 2026) ships.AppImage,.deband.rpmfor both x86_64 and arm64. The old separate "x86_64 AppImage" naming used in 2025 guides is gone — current arm64 AppImage is around 207 MB.Ollama 0.22.0 (April 28, 2026) added a built-in web-search tool (OpenClaw), better "thinking" model controls, and first-class support for Nemotron 3 Omni, Poolside Laguna XS.2, GLM-5, Kimi-K2.5, MiniMax, DeepSeek V4 and gpt-oss alongside the existing Llama, Qwen and Gemma families.Recommended models have moved on. Llama 3 → Llama 4, DeepSeek V3 →deepseek-v4/deepseek-v4-flash, Qwen 2.5 → Qwen 3, Gemma 3 → Gemma 4. The oldollama pull llama3still works but is no longer the right default.Hardware floor has crept up. 16 GB RAM and 30 GB disk is enough for 7–8B class models in Q4. For Llama 4 70B / DeepSeek V4 you want 64 GB+ system RAM or a 24 GB+ VRAM GPU.Ubuntu 24.04 LTS is now the default target. Both 22.04 and 24.04 work; 20.04 is end-of-standard-support and not recommended.The Ollama install script now provisions a dedicatedollamasystem user and a realsystemdunit on Ubuntu — you no longer need to babysit a foregroundollama serve.
Want the full picture? Read our continuously-updated Llama 4 Complete Guide (2026) — Scout and Maverick variants, MoE architecture, and deployment patterns.
TL;DR
| What | Command / value |
|---|---|
| Install Ollama | curl -fsSL https://ollama.com/install.sh | sh |
| Pull a current model | ollama pull llama4 or ollama pull qwen3 |
| Ollama API | http://localhost:11434 (OpenAI-compatible) |
| Cherry Studio download | github.com/CherryHQ/cherry-studio/releases — pick .AppImage or .deb |
| Cherry Studio provider | Settings → Model Services → Ollama, API base http://localhost:11434, no key |
| Min spec for 7–8B Q4 | Ubuntu 22.04+, 16 GB RAM, 30 GB disk |
| Recommended for 70B class | 64 GB RAM or a 24 GB+ VRAM GPU (RTX 4090 / 5090, A6000) |
Why Cherry Studio + Ollama (in 2026)
You can run a model from the terminal with ollama run, and you can put Open WebUI in front of it. Cherry Studio sits in a different niche: it's a native desktop app (Electron, but a polished one — over 40k GitHub stars) that lets you keep multiple providers side by side. The same window can talk to your local Ollama instance, OpenAI, Anthropic, Gemini, DeepSeek, Groq and ~300 other models, with shared chat history, prompt templates, file attachments, knowledge bases, and an agent / mini-program system.
Concrete reasons teams pick this combo over Open WebUI or LM Studio:
- Native UI feel. AppImage launches in a second; no Docker, no browser tab, no port collisions with other web apps you run.
- Multi-provider in one history. Switch from
llama4on Ollama to GPT-5.5 mid-thread for a sanity check without copy-pasting. - Knowledge bases and file types. Drag in PDFs, Office files, images — Cherry Studio handles ingestion and embeddings.
- Free and source-available. AGPL-style license; no seat tax.
For a deeper architectural view of running local agents end-to-end, see the OpenClaw + Ollama setup guide for running local AI agents — Cherry Studio is the chat front-end; OpenClaw is the agent layer that drives Ollama as a tool-using runtime.
Prerequisites
- OS: Ubuntu 22.04 LTS or 24.04 LTS (or Debian 12 / Pop!_OS 22.04+). Anything older than 22.04 is unsupported in 2026.
- CPU: 64-bit x86_64 with AVX2, or arm64 (Ampere / Apple M-series via VM). AVX-512 helps but is not required.
- RAM: 16 GB minimum for 7–8B Q4 models, 32 GB for 13–14B, 64 GB+ for 70B class CPU inference.
- GPU (optional but recommended): NVIDIA GPU with 8 GB+ VRAM and a recent CUDA driver (12.x). Ollama auto-detects and uses it; AMD ROCm is supported on RX 7000/9000 series.
- Disk: 30 GB free for one mid-size model and Cherry Studio. Each 7B Q4 GGUF is ~4–5 GB; 70B Q4 is ~40 GB.
- Network: Needed only for the initial install and model pulls. Inference itself runs offline.
- FUSE: AppImages need
libfuse2on 22.04,libfuse2t64on 24.04.
Step 1: Install Ollama on Ubuntu
Use the official one-liner. It creates a dedicated ollama system user, drops the binary into /usr/local/bin, and registers a systemd unit that auto-starts on boot.
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
systemctl status ollama --no-pager
Expected output: ollama version is 0.22.x and an active (running) systemd service listening on 127.0.0.1:11434.
If you ever need to expose it on the LAN (don't do this on an untrusted network — Ollama has no auth), drop a systemd override:
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama
GPU detection
If you have an NVIDIA card, install the proprietary driver (sudo ubuntu-drivers install) and reboot. Confirm with nvidia-smi. Ollama 0.22 will pick it up automatically; you'll see cuda in journalctl -u ollama. For AMD, install the ROCm 6.2+ packages from AMD's repo before installing Ollama.
Step 2: Pull a current 2026 model
Don't blindly default to llama3 in 2026 — it's not a bad model, but the cost/quality frontier moved. Pick one for your hardware:
| Model | Pull command | RAM/VRAM (Q4) | Best for |
|---|---|---|---|
| Llama 4 8B | ollama pull llama4 | ~6 GB | General chat, coding helper, tight VRAM |
| Qwen 3 14B | ollama pull qwen3:14b | ~10 GB | Multilingual, long context, agents |
| DeepSeek V4 Flash | ollama pull deepseek-v4-flash | ~14 GB (MoE, 13B active) | Reasoning, math, code, 1M-token context |
| Gemma 4 9B | ollama pull gemma4 | ~7 GB | Tool calling, structured output, multimodal |
| gpt-oss 20B | ollama pull gpt-oss | ~14 GB | OpenAI-style local fallback |
| Llama 3.2 3B | ollama pull llama3.2 | ~3 GB | Laptops, fast iteration, the most-pulled small model on Ollama |
Pull a model and sanity-check it from the CLI before wiring up the GUI:
ollama pull llama4
ollama run llama4 "Summarise the PEP 8 line-length rule in one sentence."
ollama list
Step 3: Install Cherry Studio v1.9.x
Grab the latest release from github.com/CherryHQ/cherry-studio/releases. As of this update the current is v1.9.3 (April 24, 2026).
Option A — .deb (recommended on Ubuntu)
wget https://github.com/CherryHQ/cherry-studio/releases/download/v1.9.3/Cherry-Studio-1.9.3-amd64.deb
sudo apt install ./Cherry-Studio-1.9.3-amd64.deb
cherry-studio
This drops a desktop entry, sets file associations and handles dependencies. Use the arm64.deb on ARM machines.
Option B — AppImage (portable, no root)
# For arm64 systems; replace with the amd64 AppImage URL when published for your release
wget https://github.com/CherryHQ/cherry-studio/releases/download/v1.9.3/Cherry-Studio-1.9.3-arm64.AppImage
chmod +x Cherry-Studio-1.9.3-arm64.AppImage
./Cherry-Studio-1.9.3-arm64.AppImage
If the AppImage fails silently, you're missing FUSE:
# Ubuntu 22.04
sudo apt install libfuse2
# Ubuntu 24.04
sudo apt install libfuse2t64
Step 4: Connect Cherry Studio to Ollama
- Launch Cherry Studio.
- Click the gear icon (bottom-left) to open Settings.
- Open Model Services (called "Model Providers" in older builds).
- Find Ollama in the provider list and toggle it on.
- Set:
- API Address:
http://localhost:11434(note: no trailing/v1needed; Cherry Studio appends it) - API Key: leave blank (Ollama has no auth)
- Keep Alive: 30m is a sane default — keeps the model resident between messages so you don't pay reload latency on every send
- API Address:
- Click + Add under Models and type the exact model name (
llama4,qwen3:14b,deepseek-v4-flash) — must match whatollama listshows. - Open a new chat, pick Ollama as the provider, pick your model, send "ping". You should see streaming tokens within a second or two.
How to choose the right model
A simple decision path:
- Laptop, 16 GB RAM, no dGPU →
llama3.2(3B) orgemma4:9bat Q4. Fast, usable, doesn't thrash swap. - Desktop, 32 GB RAM, RTX 3060 / 4060 (8–12 GB) →
llama4(8B) orqwen3:14bat Q4_K_M. - Workstation, 64 GB RAM, RTX 4090 / 5090 (24 GB) →
deepseek-v4-flash,llama4:70bat Q4, or Qwen 3 32B at Q5. - You need long context (100k+ tokens) →
deepseek-v4-flash(1M context) or Qwen 3 with extended-context tags. - You're building an agent that calls tools → Gemma 4 or Qwen 3 — both have native tool-calling that survives in JSON mode.
- Coding-only workload → Qwen 2.5-Coder 32B is still the best single-GPU coding model in Q1 2026 (92.7% HumanEval), though Llama 4 is closing fast.
Performance — what to expect in 2026
Approximate token-generation rates (prompt eval excluded) on Ubuntu 24.04, Ollama 0.22, Q4 quants:
| Hardware | Llama 4 8B Q4 | Qwen 3 14B Q4 | Llama 4 70B Q4 |
|---|---|---|---|
| CPU only, Ryzen 9 7950X, 64 GB DDR5 | ~12–15 tok/s | ~7–9 tok/s | ~1.5–2 tok/s |
| RTX 4070 Ti (12 GB) | ~70–90 tok/s | partial offload, ~25–30 tok/s | CPU+GPU split, ~3 tok/s |
| RTX 4090 (24 GB) | ~110–140 tok/s | ~55–70 tok/s | partial offload, ~6–8 tok/s |
| RTX 5090 (32 GB) | ~150+ tok/s | ~80–95 tok/s | ~15–20 tok/s (full offload) |
Numbers are rough and depend heavily on quant level, batch size and context length. Independent r/LocalLLaMA threads quote ~55 tok/s for Llama 3.1 8B on a single modern GPU, which lines up with what we see for Llama 4 8B. Treat these as order-of-magnitude, not benchmarks; your kernel/driver/quant combination will shift them.
What was removed and why
- Default
llama3recommendation — superseded by Llama 4 on April 2026 release schedule.llama3still pulls fine, but isn't the right default. - Hard-coded Cherry Studio v1.0.x AppImage URL — Cherry-AI's release naming and asset layout changed; always grab the latest from the releases page.
- Ubuntu 20.04 mention — out of standard support; the install script's libstdc++ requirement no longer matches stock 20.04.
- "libfuse2" universal advice — on 24.04 the package is
libfuse2t64(the t64 transition).
Common pitfalls and troubleshooting
AppImage exits silently
Run it from a terminal and watch stderr. 90% of the time it's missing FUSE — install libfuse2 (22.04) or libfuse2t64 (24.04). Less commonly, the file isn't executable; chmod +x it.
Cherry Studio shows "connection refused"
curl http://localhost:11434
# Should print: Ollama is running
sudo systemctl status ollama
journalctl -u ollama -n 50 --no-pager
If the service is down, sudo systemctl start ollama. If it's running but Cherry Studio still can't reach it, check that you didn't put https:// in the API address — Ollama is plain HTTP locally — and that no firewall (ufw) is blocking 11434 on loopback.
"Model not found" in Cherry Studio
Cherry Studio doesn't auto-discover models. The name in + Add must match ollama list exactly, including the tag (qwen3:14b, not qwen3-14b).
OOM kill or swap thrashing
You're trying to run a model that doesn't fit. Drop a quant level (Q5 → Q4_K_M → Q4_0), or pick a smaller variant. Watch free -h and nvidia-smi while the first prompt streams.
Slow first token, fast after
Cold model load. Set Keep Alive to 30m–1h in Cherry Studio, or pass OLLAMA_KEEP_ALIVE=1h to the systemd service.
GPU is installed but Ollama uses CPU
Check nvidia-smi shows the driver, then journalctl -u ollama — you should see "CUDA driver detected". On AMD, you need ROCm 6.2+ and a supported card; consumer Radeon support is partial.
Advanced: keeping the stack tidy
Manage your model zoo
ollama list # what's installed
ollama ps # what's loaded in memory right now
ollama pull qwen3:14b # add another
ollama rm llama3 # free disk
ollama show llama4 --modelfile # inspect the Modelfile
Bake a system prompt into a model
Cherry Studio supports per-chat system prompts, but if you want a "company assistant" model that's always grounded the same way, create a Modelfile:
FROM llama4
PARAMETER temperature 0.3
SYSTEM "You are a concise senior engineer. Answer in <=200 words. Cite file paths when relevant."
ollama create senior-eng -f Modelfile
ollama run senior-eng
Then add senior-eng in Cherry Studio's Ollama provider.
Run Ollama on a non-default port
Edit the systemd override (sudo systemctl edit ollama), set OLLAMA_HOST=127.0.0.1:12345, restart, and update Cherry Studio's API base to match.
FAQ
Is Cherry Studio free?
Yes. The desktop app is open source on GitHub (40k+ stars) and free to use. You only pay if you bring your own paid API keys for hosted providers like OpenAI or Anthropic. Ollama is also free.
Does Cherry Studio send my chats to a cloud server?
When you use the Ollama provider, no — every token stays on your machine. Cherry Studio only talks to whatever provider you configure. Add OpenAI as a provider and use it, and that traffic obviously goes to OpenAI.
Why pick Cherry Studio over Open WebUI or LM Studio?
Cherry Studio is native (no Docker, no browser tab) and lets you mix local Ollama models with cloud providers in the same chat history with shared knowledge bases. Open WebUI is browser-first and Docker-native — better if you want to expose it to a team. LM Studio is Mac/Windows-leaning and bundles its own inference engine; Cherry Studio + Ollama keeps inference and UI separate.
Can I run this on a server and access it from another machine?
Cherry Studio is a desktop app, so not remotely. But you can run Ollama on a beefy Linux box, point Cherry Studio (running on your laptop) at http://server-ip:11434, and put the Ollama port behind a VPN or SSH tunnel. Don't expose Ollama to the public internet — it has no authentication.
How big a model can I really run on 16 GB RAM?
Comfortably: 7–9B parameter models at Q4 (Llama 4 8B, Gemma 4 9B). Tightly: 13–14B at Q3. Beyond that you'll be swapping. If you have a GPU with 8–12 GB VRAM, Ollama will offload as much as fits and run the rest on CPU.
What's the difference between Ollama 0.21 and 0.22?
0.22.0 (April 28, 2026) added a built-in web-search tool (OpenClaw integration), better controls for "thinking"/reasoning models, MLX performance fixes, and support for Nemotron 3 Omni and Poolside Laguna XS.2. If you set up Ollama in 2025, run curl -fsSL https://ollama.com/install.sh | sh to upgrade in place.
Can Cherry Studio do RAG over my documents?
Yes. Cherry Studio has a Knowledge Base feature: drag in PDFs, Word docs, Markdown, images. It chunks and embeds them, and the chat can retrieve from them. Pair it with a local embedding model in Ollama (nomic-embed-text or similar) for a fully local pipeline.
If you'd rather not run this yourself
Setting up local LLM infrastructure is straightforward; productionising it (auth, rate limits, multi-tenant access, SOC 2 paperwork, GPU autoscaling) is not. Codersera places vetted remote developers with practical Linux, Ollama, vLLM and on-prem AI experience — useful when "let's run our own models" graduates from a side project to a stack your team actually depends on.
References & further reading
- CherryHQ/cherry-studio — Releases (GitHub)
- CherryHQ/cherry-studio — main repository
- Cherry Studio docs — Ollama provider configuration
- ollama/ollama — Releases (GitHub)
- Ollama Linux installation docs
- Ollama model library
- FOSS Force — Run your favourite LLMs on Linux with Cherry Studio (Feb 2026)
- r/LocalLLaMA — Cherry Studio: a desktop client supporting multiple providers
- Latent Space — Top Local Models List, April 2026
- Codersera — OpenClaw + Ollama setup guide for running local AI agents (pillar)
- Codersera — Install and Run Cherry Studio on Linux Ubuntu: A Complete Guide
- Codersera — Installing and running Cherry Studio with Ollama on a Mac
- Codersera — Install and Run Cherry Studio on Windows: A Complete Guide