16 min to read
Qwen3.5 is Alibaba’s latest open‑weight multimodal model family, released under the Apache 2.0 license and designed to run efficiently from phones to high‑end GPUs while still competing with frontier cloud models on many language, coding, and agent benchmarks.
Claude Code is Anthropic’s agentic coding tool that runs in the terminal, understands your codebase, and automates edits, refactors, and git workflows via natural‑language instructions.
By pointing Claude Code at a locally served Qwen3.5 instance (via llama.cpp or a similar OpenAI‑compatible server), developers can create a free, private, local AI coding agent that behaves like a Claude‑style co‑worker but runs entirely on their own hardware.
This report explains what Qwen3.5 and Claude Code are, how to install and connect them, how to benchmark and test the setup, and how it compares with popular alternatives like cloud Claude Code, Qwen Code CLI, Codeium, and Aider.
Qwen3.5 is the newest generation in Alibaba’s Qwen series, offered as a family of dense and Mixture‑of‑Experts (MoE) models from 0.8B up to 397B “A17B” activated parameters. The models are open‑weighted under the Apache 2.0 license, allowing commercial and on‑prem deployments without usage‑based licensing.
On Ollama, the Qwen3.5 library exposes small to very large variants (0.8B, 2B, 4B, 9B, 27B, 35B, 122B, plus cloud variants) with a unified 256K‑token context window, suitable for large codebases and long agentic sessions.
Unsloth and Hugging Face host quantized GGUF builds such as Qwen3.5‑4B‑IQ4 and Qwen3.5‑4B‑Q4_K_M, which shrink model size to roughly 2.5–3 GB on disk and make local inference on consumer hardware practical.
Although Qwen3.5 is a general multimodal model, its large variants score competitively on code and agent benchmarks against GPT‑5‑class and Gemini‑3‑class models.
For the flagship Qwen3.5‑397B‑A17B, Qwen reports for example:
While these results are for the largest model, the 4B and 9B “Small” variants are designed to preserve strong instruction‑following and coding performance at much lower compute, and external reviews place Qwen3.5‑4B near or above peers like Llama 3.2 3B and Gemma 3 4B on coding tasks.
Qwen3.5 Small models target on‑device and edge deployment. MindStudio’s hardware guide suggests:
Quantized GGUF builds from Unsloth expose Q4 and Q8 variants; for Qwen3.5‑4B, IQ4
a / Q4
a files are around 2.5–3 GB, small enough for SSD‑only setups and single‑GPU cards with 8–12 GB VRAM.
All Qwen3.x and Qwen3.5 models are released under Apache 2.0, explicitly allowing commercial use, redistribution, and modification as long as attribution and license terms are respected. The ecosystem supports common inference engines like llama.cpp, vLLM, SGLang, and tools such as Ollama, LM Studio, and Qwen Code CLI.
This permissive licensing plus broad tooling support is what makes Qwen3.5 especially attractive for a free, local coding agent.
Claude Code is an “agentic coding” tool from Anthropic that runs in the terminal and connects to Claude models in the cloud. It scans your repository, reads and writes files, runs tests and commands, and uses natural‑language prompts to drive incremental changes.
According to Anthropic’s docs and the npm package description, Claude Code:
npm install -g @anthropic-ai/claude-code).claude command.Claude Code access is bundled into Claude’s Pro and Max subscription tiers and some Team/Enterprise seats.
Out of the box, Claude Code expects a paid Claude plan or API; however, its architecture can be pointed at compatible backends.
Normally, Claude Code sends your prompts, repository context, and tool calls to Anthropic’s servers, which then run Claude models in the cloud. For many teams this is fine, but it raises privacy and cost concerns for sensitive or large‑volume projects.
The key idea behind “Qwen3.5 + Claude Code as a free local AI coding agent” is:
YouTube demos and community tutorials show this pattern by combining a quantized Qwen3.5 4B GGUF model served with llama-server and wiring Claude Code or similar CLIs to that local endpoint.
A typical setup looks like this:
/v1/chat/completions endpoint on localhost (for example, port 8080).OPENAI_API_KEY="EMPTY").OPENAI_BASE_URL=http://localhost:8080/v1).llama-server.This pattern mirrors how Qwen’s own Qwen Code CLI and other tools (like OpenClaw, OpenCode, or Gemini‑based CLIs) integrate with local or remote models via OpenAI‑compatible endpoints.
Operating system and tooling
Node.js or native installer for Claude Code
curl -fsSL https://claude.ai/install.sh | bashirm https://claude.ai/install.ps1 | iex in PowerShell.npm install -g @anthropic-ai/claude-code.The llama.cpp project provides high‑performance local inference for GGUF models and includes a built‑in HTTP server (llama-server).
A typical build sequence is:
bashgit clone https://github.com/ggerganov/llama.cppcd llama.cppcmake -B build -DGGML_CUDA=ON # enable CUDA for NVIDIA GPUs (optional)
cmake --build build --config Release -j
Guides from Arm, Datacamp, and others show similar commands; enabling LLAMA_BUILD_SERVER via CMake or targeting llama-server explicitly ensures the HTTP server binary is built.
After compilation, key binaries such as llama-cli, llama-server, and llama-quantize are available in the build directory.
Qwen3.5 GGUF models suitable for llama.cpp are available from Unsloth’s Qwen3.5‑4B‑GGUF repository, among others.
General GGUF download instructions from Qwen and Hugging Face are:
pip install huggingface_hubhuggingface-cli download unsloth/Qwen3.5-4B-GGUF \
Qwen3.5-4B-IQ4_NL.gguf \The exact filename may differ (for example, Qwen3.5-4B-Q4_K_M.gguf or Qwen3.5-4B-IQ4_XS.gguf); Unsloth’s model card lists available quantizations and disk sizes, such as roughly 2.5–3 GB for Q4 variants.
llama.cpp can serve GGUF models through an OpenAI‑style chat completions API by running llama-server with appropriate parameters.
A Datacamp tutorial for Qwen3.5 shows a full‑featured example (here simplified):
bash./llama.cpp/llama-server \
--model models/Qwen3.5-4B-IQ4_NL.gguf \
--alias "Qwen3.5-4B" \
--host 0.0.0.0 \
--port 8080 \
--fit on \
--ctx-size 16384 \
--jinja
Hugging Face’s GGUF/llama.cpp guide shows that you can also launch the server directly from a Hugging Face repo with -hf shorthand; llama.cpp will fetch and cache the model automatically.
Once running, the server exposes an OpenAI‑compatible /v1/chat/completions endpoint, which can be tested with a simple curl command.
With Qwen3.5 served locally, configure Claude Code (or an equivalent coding agent CLI) to talk to the local endpoint.
claude once in a project directory to let it initialize configuration and prompt you for connection details.export OPENAI_API_KEY="EMPTY" # llama.cpp ignores this
export OPENAI_BASE_URL="http://localhost:8080/v1"
export OPENAI_MODEL="Qwen3.5-4B" # must match --aliasYouTube tutorials that combine Qwen3.5 with Claude Code or OpenClaw use a similar pattern: an alias in llama-server and a configuration pointing the agent to localhost:8080 instead of a cloud provider.
To confirm that the setup is working:
localhost:8080).In a small test repo, run claude and ask:
“Scan this project and create a new script hello_agent.py that prints the current time every second.”Some community demos show exactly this kind of workflow, where Claude Code or a Qwen‑based CLI creates files, updates tests, and refactors code while backed entirely by a local Qwen model.
llama.cpp and vendor guides provide several approaches to benchmarking Qwen3.5 on your hardware.
Key metrics include:
llama_print_timings output or benchmark tools like llama-bench.AMD and NVIDIA guides for llama.cpp show how to run benchmark commands that output average tokens per second across multiple runs, including on GPUs like RTX 40‑series or MI300‑class accelerators.
For a local coding agent, practical evaluation matters more than synthetic scores. Useful real‑world tests include:
Qwen’s own benchmarks (LiveCodeBench, SWE‑bench) plus user reports from the LocalLLaMA community suggest that Qwen‑family coder models already deliver high‑quality code for many languages and scenarios, often rivaling earlier Claude or GPT‑4‑class models when unthrottled.
To validate the agent behavior rather than just raw model quality, consider scripted scenarios like:
You can automate these tests using the same approaches Anthropic and others use for evaluating coding agents, such as re‑running tests after each patch and scoring success based on pass/fail.
| Tool / Setup | Model location | Recurring cost | Offline? | Best for | Notes |
|---|---|---|---|---|---|
| Qwen3.5 + Claude Code (local) | Local Qwen3.5 via llama.cpp | None beyond hardware & power | Yes | Developers who want Claude‑style terminal agent with full local control | Requires manual setup; quality depends on chosen Qwen3.5 size and quantization. |
| Claude Code (cloud default) | Anthropic cloud (Claude Sonnet/Opus) | ≈17–20 USD/month Pro; 100–200 USD/month Max; team seats 25–150 USD/user/month | No | Teams wanting managed, highest‑quality Claude models | Best quality & support, but source and usage go to Anthropic servers. |
| Qwen Code CLI + Qwen3‑Coder | Qwen3‑Coder models (cloud via OpenRouter/Alibaba, or local) | Model/API costs or local GPU | Optional | Power users focused on Qwen ecosystem | First‑party CLI for Qwen models with strong Terminal‑Bench performance. |
| Codeium + VS Code | Codeium cloud models | Free for individuals; paid tiers for enterprises | No | Fast autocomplete and chat with almost no setup | Great developer UX but code goes to vendor backend. |
| Aider (terminal pair programmer) | Connects to Claude, OpenAI, DeepSeek, or local LLMs | Depends on chosen models | Optional | Terminal‑first workflow and git‑aware pairing | Strong git integration; works with Claude 3.5 and local models as backends. |
Consider a simple but realistic scenario: migrating a small Flask API to FastAPI and adding tests.
llama-server is already running Qwen3.5‑4B‑IQ4 on port 8080.claude is launched in the project root with environment variables pointing to the local endpoint.main.py, models.py, and test_api.py files.pytest and, if tests fail, loops back with the failure output as additional context.This is functionally similar to using cloud Claude Code, but users retain full control over data, versioning, and underlying model selection.
Running Qwen3.5 locally incurs no per‑token charges, only one‑time or amortized costs:
Apache 2.0 licensing means there is no vendor‑imposed metering or seat‑based pricing, even for commercial applications.
Claude Code is included in several Claude subscription tiers:
While these plans are attractive for individuals and small teams, large organizations or heavy users can face substantial recurring costs, especially when combined with API usage.
In this landscape, Qwen3.5 + Claude Code stands out as a way to combine a polished agent UX with truly unmetered local inference.
Guides for llama.cpp and Qwen3‑Coder show several useful flags:
--threads and --threads-batch to match CPU cores.--ctx-size to balance long‑context needs and memory.--fit on to auto‑balance VRAM and RAM when the model does not fully fit on GPU.--flash-attn on (where supported) to reduce latency.main.llama-bench to confirm that future updates or quantizations have not regressed performance.Despite these trade‑offs, the combination of open‑weight Qwen3.5 models and an agentic terminal UX like Claude Code offers one of the most compelling ways in 2026 to run a free, local AI coding agent with strong real‑world performance and no per‑token bill.
1. What is Qwen3.5 and why use it for coding?
Qwen3.5 is Alibaba’s open‑weight multimodal model family with strong coding and agent performance, available from 0.8B to 397B parameters under Apache 2.0. Smaller quantized variants like Qwen3.5‑4B‑Q4 run comfortably on consumer hardware while still delivering reliable code generation and refactoring.
2. How does Claude Code fit into a local setup?
Claude Code is an agentic coding tool that lives in your terminal, understands your repo, and automates edits, tests, and git workflows. By pointing it at a local OpenAI‑compatible server running Qwen3.5, you keep the Claude‑style UX while all model inference stays on your machine.
3. What are the main benefits of Qwen3.5 + Claude Code vs cloud tools?
You get zero per‑token costs, stronger privacy (code never leaves your hardware), and flexibility to swap models or quantizations as your needs grow. Quality is competitive for many coding tasks, especially with 4B or 9B variants, though still below top‑tier cloud Claude on very complex problems.
4. What hardware do I need to run Qwen3.5 locally?
Quantized Qwen3.5‑4B‑Q4 models are around 2.5–3 GB and are designed to run on modern laptops, desktops without a GPU, and recent mobile devices at usable speeds. Larger models like 8B or 9B benefit from dedicated GPUs or remote GPU VMs but offer significantly better reasoning and coding depth.
5. How does this setup compare to tools like Codeium, Qwen Code, or Aider?
Codeium offers fast, free cloud autocomplete and chat, while Qwen Code and Aider provide powerful terminal‑first or git‑aware agents backed by cloud or local models. Qwen3.5 + Claude Code is unique in combining a Claude‑style terminal UX with fully local, unmetered inference using an Apache‑licensed model family.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.