17 min to read
This report explains how to build a modern, completely local AI stack that runs entirely on your CPU using three components:
The goal is to:
Alibaba’s Qwen 3.5 series is a family of open‑source multimodal language models designed to bring "flagship level" intelligence to smaller sizes, from 0.8B to 9B parameters in the compact tier and much larger models in the medium and flagship tiers.
At the small end, Alibaba explicitly markets Qwen3.5‑0.8B and 2B as optimized for phones, laptops, and edge devices, highlighting faster speed and smaller memory footprint compared to the 4B and 9B variants.
For a CPU‑only setup, smaller parameter counts mean:
You will not get the same deep reasoning quality as the 4B or 9B versions, but for everyday chat, basic coding help, and lightweight analysis, Qwen3.5‑0.8B performs far above what earlier sub‑1B models could achieve.
Ollama is a local LLM runtime that wraps the performant llama.cpp backend in a friendly CLI and HTTP API, with an integrated model library.
ollama pull to download, ollama run to start chatting with a model.http://localhost:11434, making it easy to integrate with tools such as OpenClaw or Open WebUI.Guides and community tests often describe Ollama as the fastest path from zero to a working local LLM in 2026, especially for developers who are comfortable with the terminal.
While much Ollama marketing focuses on GPU acceleration, it also runs entirely on CPU.
llama.cpp build.For this guide, CPU‑only mode is sufficient for Qwen3.5‑0.8B on most laptops and desktops.
OpenClaw is an open‑source local AI gateway and agent framework. It sits between LLMs and local or cloud tools, allowing workflows that can read and write files, run scripts, call APIs, and maintain long‑term memory.
OpenClaw is often positioned as a more security‑aware, locally focused alternative to cloud‑hosted agent platforms, aiming to keep data and execution on your own machine.
Putting these three components together gives a powerful local stack:
This makes the stack particularly attractive for:
Based on official docs, community guides, and Qwen 3.5 installation examples via Ollama:
| Component | Minimum for Qwen3.5‑0.8B (CPU‑only) | Recommended for smoother experience |
|---|---|---|
| OS | Windows 10+, macOS, or modern Linux | Same, ideally recent macOS/Linux kernel |
| CPU | Any 4‑core CPU from last 5–7 years | 8+ cores (e.g., Ryzen 5/7, i5/i7 10th gen+) |
| RAM | 8 GB total (2–3 GB free) | 16 GB+ |
| Storage | ~4 GB free for tools + model | 10+ GB to try more models |
| GPU | Not required | Optional; CPU‑only works fine for 0.8B |
Qwen3.5‑0.8B specifically is advertised as a model that can "run on almost anything," with the Ollama guide citing ~500 MB of storage and minimal hardware.
The same guide provides a concise hardware table for the small models when run with Ollama:
| Model | Parameters | Approx. model size (quantized) | Minimum RAM/VRAM | Typical use case |
|---|---|---|---|---|
| Qwen3.5‑0.8B | 0.8B | ~500 MB | 2 GB | Phones, basic laptops, edge devices |
| Qwen3.5‑2B | 2B | ~1.5 GB | 4 GB | Lightweight agents, enhanced reasoning |
| Qwen3.5‑4B | 4B | ~2.5 GB | 6 GB | General‑purpose laptop assistant |
| Qwen3.5‑9B | 9B | ~5 GB | 8 GB | Higher quality reasoning on stronger machines |
For this article, the focus is on 0.8B because it guarantees good CPU‑only performance on mainstream hardware.
OllamaSetup.exe and run it.powershellollama -v
ollama list
The API will be available at http://localhost:11434 once the service is running.
Ollama.app to Applications.Ollama.app once; it starts the background service.bashollama -v
ollama list
On Linux, the official install script is the easiest path:
bashcurl -fsSL https://ollama.com/install.sh | sh
ollama -v
ollama list
Alternatively, download the standalone ollama-linux-amd64 binary from the download page and run it directly without root:
bash./ollama-linux-amd64 serve &
./ollama-linux-amd64 run llama2
This approach is commonly used on clusters or locked‑down servers where sudo is not available.
Once Ollama is installed, pulling Qwen3.5‑0.8B is a single command.
Ollama exposes Qwen 3.5 0.8B as a named library entry:
bashollama pull qwen3.5:0.8b
The official library page describes Qwen 3.5 as a family of open‑source multimodal models and includes the 0.8B variant as a ready‑to‑run model.
Start a chat session directly in the terminal:
bashollama run qwen3.5:0.8b
Type a simple prompt such as:
Explain what Qwen3.5‑0.8B is in two sentences.
Exit with Ctrl+C when done.
To confirm that the model is usable for basic assistant tasks on CPU:
Users report that even the smallest Qwen 3.5 models show strong instruction following and multilingual capability compared to older tiny models.
The official OpenClaw installer script works on macOS, Linux, and Windows (PowerShell, typically via WSL2):
bashcurl -fsSL https://openclaw.ai/install.sh | bash
This installs the CLI globally (via npm under the hood when needed), checks for Node.js 22+, and may run an onboarding wizard.
If Node 22+ is already installed and npm is configured:
bashnpm install -g openclaw@latest
openclaw onboard --install-daemon
The --install-daemon flag registers OpenClaw as a background service (systemd on Linux, launchd on macOS), ensuring the gateway keeps running across reboots.
After installation, run the basic diagnostics:
bashopenclaw doctor # check for config issues
openclaw status # gateway status
openclaw dashboard # open browser UI
If the Control UI opens and shows a healthy gateway, OpenClaw is ready.
OpenClaw communicates with local LLMs via HTTP APIs. Ollama exposes such an API at http://localhost:11434, which OpenClaw can call as a tool.
While exact configuration files vary by version, the high‑level pattern (based on a public tutorial for using OpenClaw with Ollama as a local data analyst) is:
A simplified pseudo‑configuration might look like this (YAML‑ish for illustration):
text# skills/local-ollama-qwen35-08b.skill.md
---
name: local-qwen35-08b
summary: "Use local Qwen3.5-0.8B via Ollama to answer questions."
steps:
- id: ask-model
tool: http
args:
method: POST
url: "http://localhost:11434/api/chat"
body:
model: "qwen3.5:0.8b"
messages:
- role: system
content: "You are a helpful local assistant."
- role: user
content: "{{ input }}"
In real OpenClaw setups, this is written as a Markdown front matter block plus narrative instructions, but the idea is the same: a step that posts user input to the Ollama chat API and returns the model’s reply.
ollama serve if required) and Qwen3.5‑0.8B is pulled.If the response is slow but steady, your CPU is handling the 0.8B model as expected.
A well‑documented example of OpenClaw + Ollama usage is a Local Data Analyst demo that:
trend_chart.png, analysis_report.md, and tool_trace.json on disk.According to the tutorial:
web_assistant.py) handles file uploads and sends a slash command to OpenClaw.main.py) reads the dataset, infers relevant columns, generates charts, and writes outputs to disk.Adapting this to Qwen3.5‑0.8B simply means using qwen3.5:0.8b as the Ollama model in the skill configuration.
Within a short time, you should see:
trend_chart.png).analysis_report.md).tool_trace.json).Even with a tiny 0.8B model, this demonstrates how tool‑augmented reasoning can compensate for limited raw model capacity when the tools and workflow are well‑designed.
Public benchmarks for the wider Qwen 3.5 lineup show impressive results, especially for the 9B and medium‑sized models:
While these headline numbers are mostly for 4B and 9B variants, they indicate that the underlying architecture is strong even when scaled down to 0.8B.
Exact tokens‑per‑second for Qwen3.5‑0.8B on CPU vary by hardware and quantization, but some reasonable expectations can be drawn from local LLM speed tests and small‑model behavior:
llama.cpp and Ollama report tens to hundreds of tokens per second on modern CPUs, depending on quantization and threads.For context, independent speed tests have shown that:
llama.cpp reaches around 161 tokens/s, versus 89 tokens/s for Ollama with the same model.Since Qwen3.5‑0.8B is optimized for edge and low‑resource devices, CPU‑only speed is one of its design targets.
A qualitative comparison for Qwen 3.5 models on CPU‑only setups:
| Model | Latency on mid‑range CPU (subjective) | Quality vs. 0.8B |
|---|---|---|
| 0.8B | Very fast, almost instant for short replies | Baseline |
| 2B | Fast, minor delays on longer generations | Noticeable improvement on reasoning |
| 4B | Moderate; acceptable but slower on long outputs | Much stronger reasoning and knowledge |
| 9B | Slower on CPU‑only; better with GPU | Approaches older large models and some smaller proprietary models |
For pure CPU usage, 0.8B and 2B offer the best balance between speed and usability, while 4B and 9B become more comfortable if at least partial GPU offload is available.
A number of 2025–2026 comparisons and guides evaluate the main local LLM runtimes.
| Runtime | Interface | Best for | Performance notes | Ease of setup |
|---|---|---|---|---|
| Ollama | CLI + REST API | Developers, scripts, integrations | Often 10–20% faster than LM Studio in some tests; in others slightly slower than hand‑tuned llama.cpp. | Very easy; one‑line install and ollama run |
| LM Studio | Desktop GUI | Non‑technical users, quick experiments | Sometimes slower than Ollama; in some Mac tests, LM Studio outperformed Ollama (e.g., 237 t/s vs 149 t/s on Gemma 3 1B). | Extremely easy; full GUI, no terminal needed |
| llama.cpp | CLI / library | Power users, maximum speed | Can be up to 1.8× faster than Ollama in certain benchmarks; exposes low‑level tuning. | Harder; requires manual builds and model management |
| GPT4All | Desktop GUI | Beginners wanting a ChatGPT‑like app | Emphasizes local RAG and document chat; simple configuration. | Easy; installer + built‑in model download |
| Jan | Desktop GUI + local backend | Privacy‑focused local chat | 100% offline by default; supports multiple runtimes under the hood. | Easy; but less focused on dev workflows |
USP of Ollama in this stack: it strikes a middle ground between raw performance and usability, offers a consistent curated model library, and exposes a simple HTTP API that OpenClaw can call without extra adapters.
OpenClaw competes with other open‑source or commercial agent frameworks and gateways.
USP of OpenClaw: combines a general‑purpose local gateway, rich tool orchestration, and persistent Markdown memory in one coherent platform, making it particularly suited for personal machines and privacy‑sensitive use cases.
Direct head‑to‑head benchmarks for the 0.8B model are still emerging, but several trends are clear from coverage of the Qwen 3.5 lineup and community comparisons of small models.
USP of Qwen3.5‑0.8B: unlike many tiny models that strip out modalities or compromise instruction following, it remains natively multimodal and trained with scaled RL, giving it better instruction compliance and image‑aware capabilities in a sub‑1B footprint.
Alibaba has released Qwen 3.5 models, including the small series, as open‑source, open‑weight models under Apache‑2.0, meaning:
Ollama is distributed as free software for personal and commercial use. The runtime itself is open source, and the primary "cost" is local compute and storage.
Some users may later choose paid cloud infrastructure (e.g., remote GPUs or hosting providers), but the tool itself and local usage are free.
OpenClaw is an open‑source gateway, and self‑hosting on your own machine incurs no license fees.
The main cost levers are:
For a pure local CPU‑only deployment on a single machine:
To systematically evaluate your Qwen3.5‑0.8B + OpenClaw + Ollama stack on CPU, consider the following tests.
If and when Qwen3.5‑0.8B’s vision features are exposed via Ollama and your configuration, test:
Coverage confirms that even the small Qwen 3.5 models are natively multimodal, but specific runtime support may lag behind.
| Aspect | Qwen3.5‑0.8B + OpenClaw + Ollama (CPU‑only) | Typical cloud LLM API |
|---|---|---|
| Cost | Free beyond hardware and electricity | Per‑token or monthly fees |
| Privacy | Data stays on local machine | Data sent to external servers |
| Latency | Local; may be slower on weak CPUs | Often fast, backed by large GPU clusters |
| Control | Full control over models, tools, and workflows | Limited to provider’s models and features |
| Setup difficulty | Medium – requires installing three tools | Low – call a REST API |
| Offline use | Yes, once models are downloaded | Usually no |
For many developers and privacy‑sensitive users, the trade‑off of slightly more setup and possibly lower raw model quality is acceptable given the privacy, cost, and control benefits.
model field in the Ollama API call (for example, from qwen3.5:0.8b to a Llama or Gemma model) and keep the rest of the workflow identical.Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.