Run DeepCoder on Ubuntu: Step-by-Step Install Guide

Last updated April 2026 — refreshed for current model versions, Ollama GPU support, and 2026 ecosystem context.

DeepCoder-14B-Preview is a fully open-source code reasoning model released in April 2025 by Agentica and Together AI. It matches OpenAI's o3-mini on LiveCodeBench while running entirely on local hardware. This guide walks you through every step to install and run DeepCoder on Ubuntu — from GPU drivers to a working chat interface — with accurate 2025/2026 commands and realistic hardware expectations.

What changed since this post was first published (April 2025)Ollama now includes its own CUDA runtime — you no longer need to install the full CUDA Toolkit separately. Only a compatible NVIDIA driver (531+) is required on the host.Ubuntu 24.04 LTS is now the recommended OS (22.04 still supported; 20.04 is end-of-standard-support as of April 2025).The correct Ollama tag is deepcoder:14b (9.0 GB download). A lighter 1.5B variant (deepcoder:1.5b, 1.1 GB) is also available for machines with less VRAM.Context window on Ollama is 128K tokens — the model was trained to 64K but inference servers expose up to 128K.Open WebUI has become the standard browser-based frontend for Ollama; installation now takes one Docker command.Community consensus (April 2026): For pure coding workloads, Qwen3-Coder-Next and DeepSeek V3.2 now outperform DeepCoder-14B on raw leaderboards, but DeepCoder remains one of the easiest ways to get o3-mini-level code reasoning at 14B parameters with no license restrictions.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

Overview of DeepCoder-14B

DeepCoder-14B-Preview is a code reasoning large language model (LLM) fine-tuned from DeepSeek-R1-Distill-Qwen-14B using distributed reinforcement learning (RL) with a custom algorithm called GRPO+. It was trained jointly by the Agentica team and Together AI, published April 8, 2025, under an MIT license.

Key facts you should know before running it:

14 billion parameters (plus a 1.5B distilled variant).
60.6% Pass@1 on LiveCodeBench v5 — 8% higher than its base model (53%) and on par with o3-mini (Low) at 60.9%.
1936 Codeforces rating (95.3rd percentile of human competitors).
92.6% on HumanEval+ and 73.8% on AIME 2024 (cross-domain reasoning generalizes to math).
64K context at inference (trained on 16K→32K via iterative context lengthening; generalizes to 64K).
Fully open source: weights, training data (~24K problem–test pairs), training code, and logs are all public on Hugging Face.

Performance at a Glance

Model	LiveCodeBench v5	Codeforces Rating	HumanEval+	Parameters
DeepCoder-14B-Preview	60.6%	1936	92.6%	14B
DeepSeek-R1-Distill-Qwen-14B (base)	53.0%	1791	92.0%	14B
o3-mini-2025-01-31 (Low)	60.9%	1918	92.6%	Closed
DeepSeek-R1	62.8%	1948	92.6%	671B

Source: Hugging Face model card, April 2025. LiveCodeBench v5 window: 2024-08-01 to 2025-02-01.

System Requirements

Component	Minimum	Recommended
OS	Ubuntu 22.04 LTS	Ubuntu 24.04 LTS
CPU	4-core x86-64	8-core modern CPU
RAM	16 GB	32 GB
GPU VRAM (14B model)	10 GB (fp16 quantized)	16–24 GB (full precision)
GPU VRAM (1.5B model)	4 GB	8 GB
NVIDIA Driver	531 (compute capability 5.0+)	550+ (latest stable)
Disk Space	20 GB free	50 GB free
Python	3.10	3.12

CPU-only note: You can run DeepCoder on CPU alone, but inference will be very slow (several seconds per token for the 14B model). The 1.5B variant is a practical choice for CPU-only machines.

Step-by-Step Installation Guide

Step 1 — Update System Packages

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git pciutils lshw python3 python3-pip python3-venv

Step 2 — Install NVIDIA Drivers (GPU users only)

Ollama bundles its own CUDA runtime, so you only need the NVIDIA driver — not the full CUDA Toolkit. The easiest method on Ubuntu 22.04/24.04 is the automatic installer:

# Let Ubuntu detect and install the recommended driver
sudo ubuntu-drivers autoinstall

# Or install a specific version (e.g. 570 for RTX 30xx/40xx):
sudo apt install nvidia-driver-570

sudo reboot

After reboot, confirm the driver loaded correctly:

nvidia-smi

You should see your GPU name, driver version, and a CUDA version line. Ollama requires driver version 531 or newer and a GPU with compute capability 5.0 or higher (GTX 750 Ti era and newer).

If you need the full CUDA Toolkit (only required if you plan to use vLLM or compile from source):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit

Step 3 — Install Ollama

Ollama is the easiest way to serve DeepCoder locally. It handles model download, quantization selection, CUDA detection, and a local OpenAI-compatible API automatically.

curl -fsSL https://ollama.com/install.sh | sh

The installer automatically:

Downloads the Ollama binary for your architecture (amd64 or arm64).
Installs it as a systemd service so it starts on boot.
Detects your NVIDIA GPU and links its bundled CUDA runtime.

Verify Ollama is running:

ollama --version
sudo systemctl status ollama

Alternatively, install via Snap: sudo snap install ollama

Step 4 — Pull and Run DeepCoder

Pull the model (9.0 GB download for 14B; 1.1 GB for 1.5B):

# Full 14B model (recommended — requires ~10 GB VRAM or CPU fallback)
ollama pull deepcoder:14b

# Lightweight 1.5B model (works on machines with 4 GB VRAM or CPU)
ollama pull deepcoder:1.5b

Run an interactive session:

ollama run deepcoder:14b

Test with a coding prompt:

ollama run deepcoder:14b "Write a Python function that merges two sorted arrays in O(n) time."

Step 5 — Add a Browser Interface with Open WebUI (Optional)

Open WebUI provides a full ChatGPT-style interface for your local models, including conversation history, model switching, and file upload. It connects to Ollama automatically.

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Select deepcoder:14b from the model dropdown and start chatting. No API key required.

Step 6 — Use DeepCoder via Python (Ollama API)

Ollama exposes a local OpenAI-compatible REST API at http://localhost:11434. You can call it directly from Python:

import requests
import json

def ask_deepcoder(prompt: str, model: str = "deepcoder:14b") -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": False,
            "options": {
                "temperature": 0.6,
                "top_p": 0.95,
                "num_ctx": 65536,  # Use large context for complex problems
            }
        }
    )
    return response.json()["message"]["content"]

# Example usage
result = ask_deepcoder("Write a Python class for a thread-safe LRU cache.")
print(result)

Or use the official ollama Python package:

pip install ollama

import ollama

response = ollama.chat(
    model="deepcoder:14b",
    messages=[{"role": "user", "content": "Write a recursive binary search in Python."}],
    options={"temperature": 0.6, "top_p": 0.95}
)
print(response["message"]["content"])

Alternative: Run DeepCoder with vLLM (Production Deployments)

If you need higher throughput or multi-user serving, vLLM is the recommended production inference engine. It supports tensor parallelism across multiple GPUs and serves an OpenAI-compatible API.

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model agentica-org/DeepCoder-14B-Preview \
  --max-model-len 65536 \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000

The model card recommends these inference parameters for best results:

temperature: 0.6
top_p: 0.95
max_tokens: at least 64,000 (the model reasons at length for hard problems)
No system prompt — include all instructions in the user message.

Alternative: Run via Hugging Face Transformers

If you prefer a Python-native workflow without a separate server:

pip install torch transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "agentica-org/DeepCoder-14B-Preview"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # distributes across available GPUs automatically
)

messages = [{"role": "user", "content": "Write a Python function to detect cycles in a linked list."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=4096,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Note: Loading the 14B model in bfloat16 requires approximately 28 GB of VRAM. Use load_in_4bit=True (via bitsandbytes) to reduce this to ~8 GB at a small quality cost.

How to Choose: Ollama vs vLLM vs Transformers

Method	Best for	VRAM needed (14B)	Setup complexity
Ollama	Personal use, quick start, CPU fallback	~10 GB (Q4 quantized)	Low — one command
Open WebUI + Ollama	Team use, multi-model chat UI	~10 GB	Low — Docker
vLLM	Production, high throughput, multi-GPU	~30 GB (bfloat16)	Medium — Python server
Transformers	Research, fine-tuning pipelines	28 GB (bf16) / ~8 GB (4-bit)	Medium — Python code

Common Pitfalls and Troubleshooting

GPU Not Detected by Ollama

Run nvidia-smi. If it fails, the driver is not loaded. Reinstall with:

sudo apt remove --purge nvidia-*
sudo ubuntu-drivers autoinstall
sudo reboot

If nvidia-smi works but Ollama still uses CPU, check the Ollama logs:

journalctl -u ollama -n 50

You need driver version 531 or newer. Check with nvidia-smi | grep "Driver Version".

Out of VRAM / Model Falls Back to CPU

Ollama automatically falls back to CPU if VRAM is insufficient. To force GPU-only and surface the error:

OLLAMA_GPU_OVERHEAD=0 ollama run deepcoder:14b

If VRAM is tight, switch to the 1.5B model (deepcoder:1.5b) or use a GGUF quantization from bartowski's GGUF repo (Q3_K_S fits in 8 GB VRAM).

Generation Is Very Slow

DeepCoder is a reasoning model — it generates long chains of thought before producing the final answer. For hard competitive programming problems, 2,000–8,000 output tokens is normal. If speed is the priority, set max_tokens lower or use the 1.5B variant.

Python ImportError for torch or transformers

Always use a virtual environment to isolate dependencies:

python3 -m venv deepcoder_env
source deepcoder_env/bin/activate
pip install torch transformers accelerate bitsandbytes

Ollama Port Already in Use

Ollama runs on port 11434 by default. If that port is taken:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Model Download Interrupted

Ollama supports resumable downloads. Re-run ollama pull deepcoder:14b and it will resume from where it stopped.

What Was Removed from the Original Post and Why

Removed: git clone https://github.com/deepcode-ai/deepcoder.git — The deepcode-ai GitHub repository cited in the original post is a separate unrelated project. The actual DeepCoder-14B model by Agentica is at agentica-labs/agentica on GitHub and distributed via Hugging Face. Do not confuse the two.
Removed: from deepcoder import generate_code Python API — No such Python package exists for DeepCoder-14B. The model is accessed via Ollama's API, vLLM, or Hugging Face Transformers.
Removed: docker pull deepcode-ai/deepcoder:latest — No official DeepCoder Docker image exists on Docker Hub under that name.
Removed: separate CUDA Toolkit installation as a hard requirement — Ollama bundles its own CUDA runtime. The original post's requirement to sudo apt install nvidia-cuda-toolkit is unnecessary for Ollama users.
Updated: NVIDIA driver version — The original recommended nvidia-driver-525. The current minimum for Ollama is 531; driver 570 is the current recommended stable release for RTX 30xx/40xx.

DeepCoder in the 2026 Local LLM Landscape

As of April 2026, the local coding model ecosystem has moved fast. Community consensus (per Latent.Space April 2026 top local models thread) places Qwen3-Coder-Next as the top local coding model for pure benchmark performance, with DeepSeek V3.2 leading for general-purpose open-weight use.

DeepCoder-14B remains relevant because:

It is one of the few sub-20B models that achieves o3-mini-level code reasoning.
The MIT license has no usage restrictions — you can embed it in commercial products.
It runs entirely on a single consumer GPU (RTX 3090 / 4090 class), whereas larger alternatives like DeepSeek V3.2 require multi-GPU setups for comfortable inference.
The 1.5B variant is genuinely useful for code completion and simple generation tasks on laptops.

If your team needs to scale beyond a single GPU or serve many users concurrently, consider working with a team of vetted developers who have hands-on experience deploying inference infrastructure — Codersera's AI engineers specialize in exactly this kind of production LLM deployment.

FAQ

Can I run DeepCoder-14B without a GPU?

Yes. Ollama automatically falls back to CPU inference when no compatible GPU is found. Expect ~1–5 tokens per second on a modern 8-core CPU, which is slow but functional for low-frequency use. The 1.5B model is more practical for CPU-only machines.

What is the difference between DeepCoder and DeepSeek-Coder?

They are different models from different organizations. DeepCoder-14B is by Agentica/Together AI, fine-tuned from DeepSeek-R1-Distill-Qwen-14B using RL for code reasoning. DeepSeek-Coder and DeepSeek-Coder-V2 are base code models from DeepSeek AI. DeepCoder is focused on competitive programming-style reasoning; DeepSeek-Coder-V2 is a broader code completion and generation model.

Does DeepCoder work with VS Code or other IDEs?

Yes. Since Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1, any IDE extension that supports custom OpenAI endpoints works. Popular choices include Continue.dev, Cody (Sourcegraph), and Cursor (custom base URL). Set the model to deepcoder:14b and the base URL to your Ollama endpoint.

How much disk space does DeepCoder take?

The 14B GGUF model pulled by Ollama is 9.0 GB. The 1.5B model is 1.1 GB. Full-precision Hugging Face weights (bfloat16) for the 14B model are approximately 28 GB. Plan your disk budget accordingly.

What Ubuntu version should I use?

Ubuntu 24.04 LTS is recommended for new setups. Ubuntu 22.04 LTS is fully supported. Ubuntu 20.04 LTS reached end of standard support in April 2025; it will still work but is not recommended for new deployments.

Can I use DeepCoder in a commercial product?

Yes. The model is released under the MIT license with no restrictions on commercial use. Verify the license at the Hugging Face model card before deployment.

Why does the model output so much text before giving an answer?

DeepCoder uses chain-of-thought reasoning (inherited from DeepSeek-R1's training). It thinks through the problem step by step inside <think> tags before producing the final answer. This is intentional — the lengthy reasoning is what enables the high benchmark scores. You can strip the thinking section from the output in production if you only need the code.

What is the best quantization for a 16 GB VRAM GPU?

Use Q5_K_M from bartowski's GGUF builds — it fits in 11–12 GB VRAM with minimal quality loss. Ollama's default pull uses Q4 quantization (approximately 9 GB loaded), which also fits comfortably.

Run DeepCoder on Ubuntu: Step-by-Step Installation Guide (2026)