Last updated April 2026 — refreshed for current model/tool versions.
DeepScaleR-1.5B-Preview is a 1.5-billion-parameter math-reasoning model that outscores OpenAI's o1-Preview on AIME 2024 while running on a single consumer GPU with as little as 4 GB of VRAM. This guide covers every method for getting it running on Linux — from a bare llama.cpp binary to an Ollama one-liner to a production-grade vLLM server — with updated 2026 build flags, current quantization recommendations, and concrete troubleshooting steps.
What changed since February 2025 (original post)llama.cpp build system: The project migrated frommaketo CMake as the primary build path. The oldLLAMA_CUDA=1 makeflag is deprecated; the correct flag is now-DGGML_CUDA=ON(cmake). AMD ROCm flag changed fromLLAMA_ROCMto-DGGML_HIP=ON.llama-server is the canonical server binary —./mainwas renamed; usellama-serverandllama-cli. The server now exposes a full OpenAI-compatible/v1/chat/completionsendpoint alongside the older/completionroute.Ollama reached v0.21.0 (April 2026). DeepScaleR is in the official Ollama library; the correct command isollama run deepscaler— not a path to a local.gguffile.Better GGUF quant selection: Q4_K_M (1.12 GB) is now the community-recommended default, not Q8_0. Q8_0 (1.89 GB) is reserved for maximum-quality use cases.vLLM is now a first-class option for GPU-accelerated server deployments, supporting the HuggingFace checkpoint directly viaagentica-org/DeepScaleR-1.5B-Preview.CUDA unified memory flag: SetGGML_CUDA_ENABLE_UNIFIED_MEMORY=1to swap to system RAM when VRAM is exhausted instead of crashing.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
TL;DR
| Goal | Recommended approach | VRAM needed |
|---|---|---|
| Quickest test | ollama run deepscaler | ~4 GB (Q4_K_M) |
| CPU-only, full quality | llama.cpp + Q4_K_M GGUF | None (uses RAM) |
| NVIDIA GPU inference | llama.cpp -DGGML_CUDA=ON + Q4_K_M | ~4 GB |
| AMD GPU inference | llama.cpp -DGGML_HIP=ON + Q4_K_M | ~4 GB |
| Production API server | vLLM with HuggingFace checkpoint | ~8 GB |
| Max quality, max RAM | llama.cpp + Q8_0 (1.89 GB) | ~6 GB |
What Is DeepScaleR-1.5B-Preview?
DeepScaleR-1.5B-Preview was released in February 2025 by the Agentica project (Berkeley Sky Computing Lab / Berkeley AI Research). It is fine-tuned from DeepSeek-R1-Distill-Qwen-1.5B using Distributed Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO), trained on roughly 40,000 competition math problems sourced from AIME (1984–2023), AMC (pre-2023), Omni-MATH, and the Still dataset.
The key training innovation is iterative context lengthening: training starts at 8K context, extends to 16K, then 24K. This forces the model to first learn efficient short-chain reasoning before learning to reason over longer chains — producing better generalisation than training at long context from the start.
The model is MIT-licensed and fully open: weights, training code, hyperparameters, and dataset are all public under agentica-project/rllm.
Performance and Benchmarks
The following results are from the official model card on Hugging Face (agentica-org/DeepScaleR-1.5B-Preview). All scores are Pass@1 unless noted.
| Model | Params | AIME 2024 | MATH 500 | AMC 2023 | OlympiadBench | Avg. |
|---|---|---|---|---|---|---|
| DeepScaleR-1.5B-Preview | 1.5B | 43.1 | 87.8 | 73.6 | 50.0 | 57.0 |
| OpenAI o1-Preview | n/a | 40.0 | 81.4 | — | — | — |
| DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | 28.8 | 82.8 | 62.9 | 43.3 | 48.9 |
| Still-1.5B | 1.5B | 32.5 | 84.4 | 66.7 | 45.4 | 51.6 |
DeepScaleR's 43.1% AIME 2024 score is a 14.3 percentage-point improvement over its base model, achieved without any increase in model size. Note that these benchmarks reflect the state of the field as of early 2025; larger reasoning models released since then (including DeepSeek's V4 series and Meta's Llama 4 family) have pushed the frontier significantly further, but for on-device 1.5B inference DeepScaleR remains a strong baseline for math-heavy tasks.
System Requirements
Hardware
- RAM (CPU-only): 4 GB minimum for Q4_K_M; 8 GB recommended for comfortable multitasking. Q8_0 needs ~6 GB RAM.
- GPU (optional): Any NVIDIA GPU with CUDA 12.x or AMD GPU with ROCm 7+. A 4 GB VRAM card runs Q4_K_M fully on-GPU.
- Storage: ~1.2 GB for Q4_K_M; ~1.9 GB for Q8_0; ~3.6 GB for the Ollama default (F16-based).
- CPU: x86_64 with AVX2 or AVX-512 for best CPU throughput. ARM64 is also supported via llama.cpp.
Software Dependencies
- Linux (Ubuntu 22.04+, Debian 12+, Fedora 39+, or any modern distro with glibc 2.31+)
git,cmake3.21+,build-essential(GCC 11+ or Clang 14+)- Python 3.10+ with
pip(for vLLM or Python API path) - CUDA Toolkit 12.6+ (for NVIDIA GPU builds) — CUDA 13.x has some known compatibility issues with certain quant types (see Troubleshooting)
- ROCm 7+ (for AMD GPU builds)
# Ubuntu 22.04 / 24.04 — install build essentials
sudo apt update
sudo apt install -y git cmake build-essential python3 python3-pip
Method 1: llama.cpp (CPU and GPU)
llama.cpp is the most flexible path — it runs on any Linux hardware and supports every quantization format. The project moved to CMake as the primary build system; the old make approach still works for simple CPU builds but CMake is recommended for all GPU builds.
Clone and Build (CPU)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
Build with NVIDIA CUDA
Use -DGGML_CUDA=ON. The old LLAMA_CUDA=1 make and LLAMA_CUBLAS flags are deprecated and will silently fall back to CPU-only if used without the GGML_ prefix.
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
To target a specific CUDA architecture (e.g., RTX 4090 = sm_89):
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89"
cmake --build build --config Release -j$(nproc)
Build with AMD ROCm/HIP
The flag changed from LLAMA_ROCM to GGML_HIP=ON. Replace gfx1030 with your GPU's target (e.g., gfx1100 for RX 7900):
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030
cmake --build build --config Release -j$(nproc)
Build with Vulkan (generic GPU)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
Download the GGUF Model
Multiple quantization levels are available. Q4_K_M is the recommended starting point for most users — it delivers good quality at 1.12 GB. Q8_0 (1.89 GB) is the highest-quality quantized option if VRAM or RAM allows. The original post used a specific NikolayKozloff GGUF; the recommended source is now bartowski's GGUF collection, which is actively maintained.
| Quantization | File size | Recommended for |
|---|---|---|
| Q2_K | 0.75 GB | Absolute minimum RAM only |
| Q3_K_M | 0.92 GB | Very tight RAM budgets |
| Q4_K_M | 1.12 GB | Best default choice |
| Q5_K_M | 1.29 GB | Higher accuracy, moderate RAM |
| Q6_K | 1.46 GB | Near-lossless, ample RAM |
| Q8_0 | 1.89 GB | Maximum quality |
| F16 | 3.56 GB | Full precision / benchmarking only |
Install the Hugging Face CLI and download your chosen quant:
pip install -U "huggingface_hub[cli]"
# Download Q4_K_M (recommended)
huggingface-cli download bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF \
--include "agentica-org_DeepScaleR-1.5B-Preview-Q4_K_M.gguf" \
--local-dir ./models/
# Or download Q8_0 for maximum quality
huggingface-cli download bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF \
--include "agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf" \
--local-dir ./models/
CLI Inference with llama-cli
The binary was renamed from ./main to llama-cli in current builds:
./build/bin/llama-cli \
--model models/agentica-org_DeepScaleR-1.5B-Preview-Q4_K_M.gguf \
--ctx-size 8192 \
--threads $(nproc) \
-p "Derive the Taylor series expansion for sin(x) and prove convergence."
For GPU offloading, add --n-gpu-layers 99 (offloads all layers):
./build/bin/llama-cli \
--model models/agentica-org_DeepScaleR-1.5B-Preview-Q4_K_M.gguf \
--ctx-size 8192 \
--n-gpu-layers 99 \
-p "Solve: Find all integer solutions to x^2 + y^2 = z^2 where z ≤ 20."
Running llama-server (OpenAI-Compatible API)
The server binary is now llama-server. It exposes both the legacy /completion endpoint and the OpenAI-compatible /v1/chat/completions endpoint:
./build/bin/llama-server \
--model models/agentica-org_DeepScaleR-1.5B-Preview-Q4_K_M.gguf \
--ctx-size 16384 \
--n-gpu-layers 99 \
--host 0.0.0.0 \
--port 8080
Query via the OpenAI-compatible endpoint (works with any OpenAI SDK):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepscaler",
"messages": [
{"role": "user", "content": "Explain the laws of thermodynamics with examples."}
],
"max_tokens": 512
}'
Or use the legacy endpoint for simple completion:
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain the laws of thermodynamics", "n_predict": 256}'
Method 2: Ollama (Easiest Path)
Ollama reached v0.21.0 in April 2026. DeepScaleR is now in the official Ollama library, so no manual GGUF download is needed. Ollama handles CUDA and Metal GPU detection automatically on supported hardware.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
On systemd-based distributions (Ubuntu, Debian, Fedora), enable the service:
sudo systemctl enable --now ollama
Run DeepScaleR
# Pull and run interactively
ollama run deepscaler
# Or pull only (for later use)
ollama pull deepscaler
The default Ollama tag (deepscaler:latest / deepscaler:1.5b) uses a 3.6 GB model with a 128K context window. It is quantized at roughly F16 precision. If you need a smaller footprint, use llama.cpp with Q4_K_M instead.
Querying via Ollama API
# REST API
curl http://localhost:11434/api/chat \
-d '{
"model": "deepscaler",
"messages": [{"role": "user", "content": "Prove that there are infinitely many prime numbers."}]
}'
# OpenAI-compatible endpoint (Ollama 0.6+)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepscaler",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}]
}'
Python with Ollama
import ollama
response = ollama.chat(
model="deepscaler",
messages=[{"role": "user", "content": "Solve: integral of x*sin(x) dx"}]
)
print(response["message"]["content"])
Method 3: vLLM (Production Server)
For teams running DeepScaleR as a shared inference service, vLLM provides continuous batching, OpenAI API compatibility, and significantly higher throughput than llama.cpp's server at the cost of requiring a proper NVIDIA GPU setup. It loads directly from the HuggingFace checkpoint — no GGUF conversion needed.
Install vLLM
pip install vllm
Start the Server
vllm serve agentica-org/DeepScaleR-1.5B-Preview \
--dtype auto \
--api-key your-api-key
Python API
from vllm import LLM, SamplingParams
llm = LLM(model="agentica-org/DeepScaleR-1.5B-Preview")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
prompts = ["Implement a binary search tree in Python with insert, delete, and search."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Practical Examples
Math Reasoning (Primary Use Case)
DeepScaleR excels at step-by-step mathematical problem solving. Use higher max_tokens values to allow the model to complete its chain-of-thought:
import requests
def math_solve(problem: str, max_tokens: int = 1024) -> str:
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "deepscaler",
"messages": [{"role": "user", "content": problem}],
"max_tokens": max_tokens,
"temperature": 0.6
}
)
return response.json()["choices"][0]["message"]["content"]
result = math_solve("Find all real solutions to x^4 - 5x^2 + 4 = 0.")
print(result)
Code Generation
def generate_code(task: str) -> str:
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "deepscaler",
"messages": [
{"role": "system", "content": "You are an expert Python programmer. Write clean, documented code."},
{"role": "user", "content": task}
],
"max_tokens": 800,
"temperature": 0.3
}
)
return response.json()["choices"][0]["message"]["content"]
print(generate_code("Implement a LRU cache using Python's collections module with O(1) get and put."))
Interactive Chat Session
import requests
def chat(history: list, user_input: str) -> str:
history.append({"role": "user", "content": user_input})
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={"model": "deepscaler", "messages": history, "max_tokens": 512}
)
reply = response.json()["choices"][0]["message"]["content"]
history.append({"role": "assistant", "content": reply})
return reply
history = [{"role": "system", "content": "You are a helpful math tutor."}]
while True:
user_input = input("You: ")
if user_input.lower() in ("exit", "quit"):
break
print("Model:", chat(history, user_input))
Optimization Strategies
Choosing the Right Quantization
Quantization is the biggest lever for balancing quality vs. resource use. From the bartowski GGUF collection:
- Q4_K_M (1.12 GB) — best default. Good quality, runs on 4 GB VRAM or 6 GB RAM.
- Q5_K_M (1.29 GB) — step up in quality for 6 GB VRAM systems.
- Q6_K (1.46 GB) — near-lossless; use when quality matters more than speed.
- Q8_0 (1.89 GB) — highest quantized quality; needs ~8 GB RAM or 6 GB VRAM.
- Q3_K_M (0.92 GB) — for very tight RAM budgets, noticeable quality loss on complex math.
- IQ4_XS (1.02 GB) — good for cuBLAS/rocBLAS; not compatible with Vulkan backend.
GPU Settings
# Offload all layers to GPU
--n-gpu-layers 99
# Enable CUDA unified memory (avoids OOM crash; swaps to system RAM)
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
# Multi-GPU pipeline parallelism improvement
export CUDA_SCALE_LAUNCH_QUEUES=4x
Context Window and Batching
- For interactive chat,
--ctx-size 8192is sufficient. For long-chain math reasoning, use 16384 or higher (model supports up to 128K context). - Increase batch size with
--batch-size 512(or higher) for higher throughput at the cost of latency. - Set
--threads $(nproc)for CPU inference; over-threading can hurt performance on NUMA systems — test with half the core count if throughput drops.
CPU-Specific Optimizations
- Build with AVX-512 enabled (CMake detects this automatically via
-DGGML_NATIVE=ON). - On NUMA systems (multi-socket servers), bind the process with
numactl --cpunodebind=0 --membind=0to prevent cross-socket memory traffic. - Ensure sufficient swap space if using large context windows on RAM-limited systems:
sudo fallocate -l 8G /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile.
How to Choose Your Deployment Method
- Just want to try it now? →
ollama run deepscaler. Done. - CPU-only server or laptop without GPU? → llama.cpp with Q4_K_M. Build without
-DGGML_CUDA=ON. - NVIDIA GPU (4–8 GB VRAM)? → llama.cpp with
-DGGML_CUDA=ON, Q4_K_M,--n-gpu-layers 99. - AMD GPU on Linux? → llama.cpp with
-DGGML_HIP=ON, ROCm 7+. - Serving multiple users / production? → vLLM with the HuggingFace checkpoint. Requires NVIDIA GPU with 8+ GB VRAM.
- Need OpenAI API drop-in replacement? → Either llama-server or vLLM — both expose
/v1/chat/completions.
Common Pitfalls and Troubleshooting
Deprecated Build Flags (Most Common Issue)
If you used the original post's LLAMA_CUDA=1 make or LLAMA_ROCM=1 make commands, CMake will show a deprecation warning and the build may fall back silently to CPU-only. Always use the current CMake flags:
- CUDA:
cmake -B build -DGGML_CUDA=ON - ROCm:
cmake -B build -DGGML_HIP=ON - OpenCL:
cmake -B build -DGGML_OPENCL=ON - Vulkan:
cmake -B build -DGGML_VULKAN=ON
Out-of-Memory Crashes
- Reduce context length:
--ctx-size 4096instead of 16384. - Switch to a smaller quant (Q4_K_M → Q3_K_M).
- Enable CUDA unified memory:
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1before running. - Reduce GPU layers:
--n-gpu-layers 20(partial offload, rest on CPU).
CUDA 13.x Compatibility Issues
As of early 2026, CUDA 13.2 has known issues with certain quantization types (IQ3_S) in llama.cpp. If you hit quantization-related errors on CUDA 13.x, use CUDA 12.6 or switch to Q4_K_M/Q8_0 which are unaffected. Track the issue at ggml-org/llama.cpp#21255.
Ollama: GPU Not Detected
- Verify CUDA:
nvidia-smishould show your GPU and driver version. - Verify ROCm:
rocminfoshould list your GPU. Requires ROCm 7+. - Check Ollama logs:
journalctl -u ollama -f. - Reinstall Ollama after installing/updating CUDA drivers — the Ollama installer bakes in library paths.
Compilation Failures
- Missing
cmake:sudo apt install cmake(Ubuntu) orsudo dnf install cmake(Fedora). - GCC too old: DeepScaleR's GGUF format requires llama.cpp built with GCC 11+. Check with
gcc --version. - CUDA not found: set
CMAKE_CUDA_COMPILERexplicitly:cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc.
Slow CPU Inference
- Verify AVX2/AVX-512 is enabled:
grep avx2 /proc/cpuinfo | head -1. - Build with
-DGGML_NATIVE=ONto auto-detect and enable CPU features. - Reduce
--threadsif performance degrades (over-threading hurts on some CPUs). - For pure CPU inference, Q4_K_M at 4–8 tokens/sec is normal on a modern 8-core laptop CPU.
Related Resources on Codersera
If you're evaluating DeepScaleR for a production AI pipeline or need developers to build and deploy local LLM systems, Codersera can help you hire vetted remote developers with ML engineering experience. For complementary local AI deployment guides, see also:
- Deployment and Execution of DeepScaleR 1.5B on Windows
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
FAQ
Is DeepScaleR-1.5B still relevant in 2026?
Yes, for on-device math reasoning. Larger models (DeepSeek V4, Llama 4 Scout) have higher absolute scores, but they require far more VRAM and compute. DeepScaleR-1.5B remains one of the strongest open math-reasoning models you can run on a consumer GPU or CPU-only machine.
Can I run it without a GPU?
Yes. llama.cpp with Q4_K_M runs fully on CPU using RAM. Expect 3–8 tokens/second on a modern 8-core CPU, which is usable for interactive queries but slow for batch inference.
What is the context window?
The model supports up to 128K context. In practice, llama.cpp defaults to 512 tokens unless you set --ctx-size. For math problem solving, 8K–16K is the sweet spot. The 128K limit comes from the Qwen base architecture.
What quantization should I use?
Q4_K_M is the community-recommended default (1.12 GB, good quality). Use Q8_0 if you have extra RAM/VRAM and want maximum quality. Avoid Q2_K for math reasoning — the quality degradation on complex problems is significant.
Can I fine-tune DeepScaleR further?
Yes. The training code is available at agentica-project/rllm under the deepscaler branch. The training uses GRPO on top of the Verl framework. You'll need 32× A100-80GB GPUs to replicate the full training, but fine-tuning on domain-specific problems with fewer resources is feasible.
Does it work with Open WebUI?
Yes. Both llama-server (via /v1/chat/completions) and Ollama expose OpenAI-compatible endpoints that Open WebUI connects to out of the box. Set the API base URL in Open WebUI to http://localhost:8080/v1 (llama-server) or http://localhost:11434/v1 (Ollama).
Is there a Docker image?
Ollama ships its own Docker image: docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama. For llama.cpp, the community maintains CUDA-enabled Docker images at ai-dock/llama.cpp-cuda. For vLLM, use the official vLLM Docker image: vllm/vllm-openai.
What is the model's license?
MIT License. Commercial use is permitted.
References and Further Reading
- DeepScaleR-1.5B-Preview — Official Hugging Face Model Card (benchmarks, training details, citation)
- bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF — All quantized GGUF variants
- llama.cpp Official Build Documentation (current CMake flags and GPU backend instructions)
- agentica-project/rllm — Official Training Code and DeepScaleR Branch
- DeepScaleR on Ollama Library (available tags and run command)
- llama.cpp GitHub Releases (changelog and release notes)
- Running DeepScaleR-1.5B with vLLM on DigitalOcean (production deployment tutorial)
- DeepScaleR: Effective RL Scaling of Reasoning Models via Iterative Context Lengthening (OpenReview paper)