Run DeepHermes 3 on Linux: Complete Installation Guide (2026)

Run DeepHermes 3 on Linux: Complete Installation Guide (2026)

Last updated April 2026 — refreshed for current model versions, Ollama v0.22.0, and the full DeepHermes 3 model family.

DeepHermes 3 is Nous Research's hybrid reasoning model that lets you toggle between fast conversational responses and deep chain-of-thought reasoning using a single system prompt. This guide covers every path to running DeepHermes 3 on Linux — from a one-command Ollama install to manual GGUF loading with llama.cpp — with verified hardware requirements, real benchmark numbers, and troubleshooting for the most common failures.

What changed in 2026 — read this if you followed the 2025 version of this guideModel family expanded: DeepHermes 3 now has three variants — 3B (March 2025), 8B (February 2025), and 24B (March 2025, built on Mistral Small 3). The original guide only covered the 8B.Correct Ollama model names: The original guide mistakenly used ollama run deepseek-r1:70b — that is a completely different model. The correct community Ollama name for the 8B is rjmalagon/deephermes-3-llama-3:8b-bf16.Arch Linux section removed: The original guide opened with a full Arch Linux installation tutorial that has nothing to do with DeepHermes. That section has been replaced with distro-agnostic Ollama installation that works on Ubuntu, Debian, Fedora, and Arch.Hermes 4 and 4.3 are now available: If you need production-grade reasoning at scale, Nous Research shipped Hermes 4 (January 2025, 14B / 70B / 405B) and Hermes 4.3 (August 2025, 36B, trained on the Psyche decentralized network). DeepHermes 3 remains a strong choice for local consumer hardware.Ollama now at v0.22.0 (April 28, 2026), with native "thinking" model support, built-in web search via OpenClaw, and Flash Attention v2.7 acceleration.AMD GPU support requires ROCm v7 on Linux (was v6 in early 2025 guides).

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — which variant should you run?

Variant Parameters Disk (Q4_K_M) Minimum VRAM / RAM Best for
DeepHermes 3 3B 3B ~2 GB 4 GB Laptops, Raspberry Pi 5, edge devices
DeepHermes 3 8B 8B ~5 GB 8 GB Most desktops and laptops with a mid-range GPU
DeepHermes 3 24B 24B (Mistral base) ~14 GB 24 GB VRAM or 32 GB RAM High-quality reasoning on a workstation GPU (RTX 3090 / 4090)

If you are unsure, start with the 8B. It runs entirely on GPU with 8 GB VRAM and delivers strong reasoning performance without the storage overhead of the 24B.

What is DeepHermes 3?

DeepHermes 3 Preview was released by Nous Research on February 14, 2025 (8B) and expanded to 3B and 24B on March 13, 2025. It is one of the first open-weight models to unify two distinct response modes in a single checkpoint:

  • Intuitive mode — standard fast response, like a normal chat model.
  • Reasoning mode — the model produces a long internal monologue inside <think></think> tags before answering, spending up to 13,000 tokens on deliberation. This improves accuracy on math, logic, and multi-step planning problems.

The switch between modes is controlled entirely by the system prompt — no separate model download, no fine-tuning required. The 8B variant is built on Meta's Llama 3.1 8B. The 24B variant is built on Mistral Small 3. All variants support function calling, JSON-structured output, and 128K token context windows (via the Ollama Hermes 3 library).

Nous Research positioned DeepHermes 3 as a stepping stone toward Hermes 4, which shipped with 14B, 70B, and 405B variants in late 2025 — but DeepHermes 3 remains the most practical choice for running locally on consumer hardware.

Prerequisites

Before installing, verify your system meets these requirements:

  • OS: Any 64-bit Linux distribution — Ubuntu 22.04+, Debian 12+, Fedora 38+, Arch Linux, or RHEL/CentOS 9+.
  • RAM: 8 GB minimum for the 8B model. 16 GB recommended for comfortable multitasking. 32 GB for the 24B without a dedicated GPU.
  • Storage: 10 GB free for the 8B GGUF (Q4_K_M). 20 GB for the 24B. Allow headroom for Ollama itself (~1 GB).
  • GPU (optional but strongly recommended):
    • NVIDIA: Compute Capability 5.0 or higher, driver version 531 or newer. Run nvidia-smi to confirm.
    • AMD: ROCm v7 or higher on Linux. Install via amdgpu-install from AMD's ROCm docs.
    • Intel Arc: Experimental Vulkan support via OLLAMA_VULKAN=1.
  • Internet: Required to download models (5–14 GB depending on variant). The model runs offline once downloaded.

Step 1 — Install Ollama

Ollama (current stable: v0.22.0, released April 28, 2026) is the easiest way to run DeepHermes 3 locally. It handles model downloads, GGUF quantization selection, GPU detection, and an OpenAI-compatible REST API out of the box.

One-line install (all distros)

curl -fsSL https://ollama.com/install.sh | sh

The installer detects your GPU automatically and installs the appropriate CUDA or ROCm libraries. After installation completes, verify it worked:

ollama --version

Expected output: ollama version 0.22.0 (or newer).

Enable Ollama as a systemd service

The installer creates a systemd service automatically on systemd-based distros (Ubuntu, Debian, Fedora, RHEL). Verify it is running:

sudo systemctl status ollama

To start it manually if needed:

sudo systemctl enable --now ollama

Ollama listens on http://localhost:11434 by default. To expose it on your local network (for use with Open WebUI or other frontends), add the environment variable to the service:

sudo systemctl edit ollama

Add the following block, then save and restart the service:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama

Verify GPU acceleration

After starting Ollama, run a small model to confirm GPU is active:

ollama run hermes3:3b "What is 2+2?"

In a second terminal, watch GPU utilization:

# NVIDIA
watch -n 1 nvidia-smi

# AMD
watch -n 1 rocm-smi

You should see GPU utilization spike during inference. If it stays at 0%, your GPU drivers are not configured correctly — see the Troubleshooting section.

Step 2 — Download and Run DeepHermes 3

Via Ollama (community model)

The Nous Research official DeepHermes 3 8B model is available via the community Ollama registry. Run it directly — Ollama downloads it automatically on first use:

# 8B variant (recommended starting point, ~5 GB download)
ollama run rjmalagon/deephermes-3-llama-3:8b-bf16

To remove a model when you no longer need it:

ollama rm rjmalagon/deephermes-3-llama-3:8b-bf16

Via the official Hermes 3 Ollama library

The official hermes3 library on Ollama includes Hermes 3 (a close relative of DeepHermes 3, sharing the same base dataset and chat format) in four sizes. Use this if you want a reliably-updated, officially-maintained model:

Command Size VRAM needed
ollama run hermes3:3b 2.0 GB 4 GB
ollama run hermes3:8b 4.7 GB 8 GB
ollama run hermes3:70b 40 GB 48 GB (multi-GPU)
ollama run hermes3:405b 229 GB Server-grade only

All hermes3 variants use the same 128K context window and support the DeepHermes reasoning toggle via system prompt.

Via GGUF and llama.cpp (advanced)

If you prefer to manage model files yourself or need a specific quantization level, download the GGUF files directly from Hugging Face and run them with llama.cpp:

# Install llama.cpp (build from source or use a pre-built package)
# Ubuntu/Debian example:
sudo apt install libcurl4-openssl-dev libssl-dev
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc) GGML_CUDA=1  # omit GGML_CUDA=1 for CPU-only

# Download Q4_K_M quantization (~5 GB for 8B)
wget https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF/resolve/main/DeepHermes-3-Llama-3-8B-Preview.Q4_K_M.gguf

# Run
./llama-cli -m DeepHermes-3-Llama-3-8B-Preview.Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 35 \
  -p "You are Hermes, an AI assistant"

Via vLLM (high-throughput serving)

For production use — running DeepHermes 3 as an API endpoint for multiple users or applications — vLLM provides significantly higher throughput than Ollama:

pip install vllm
vllm serve NousResearch/DeepHermes-3-Llama-3-8B-Preview \
  --tensor-parallel-size 1 \
  --max-model-len 8192

This starts an OpenAI-compatible server on port 8000. If your team needs to hire AI developers to deploy DeepHermes in a production pipeline, Codersera can connect you with engineers experienced in vLLM and LLM serving infrastructure.

Step 3 — Toggle Reasoning Mode On and Off

This is DeepHermes 3's defining feature. The model behavior changes entirely based on the system prompt you provide.

Reasoning mode ON (deep thinking)

Use this system prompt when you need the model to work through a problem carefully:

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

With this prompt active, DeepHermes 3 will produce output like:

<think>
Let me carefully consider the problem...
[up to 13,000 tokens of internal deliberation]
</think>

[final answer]

Reasoning mode OFF (fast response)

Use this system prompt for conversational queries, simple lookups, or when you want speed over depth:

You are Hermes, an AI assistant. Be concise and helpful.

Using the reasoning toggle in Python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "NousResearch/DeepHermes-3-Llama-3-8B-Preview"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Reasoning mode ON
messages = [
    {
        "role": "system",
        "content": "You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem."
    },
    {"role": "user", "content": "Explain the trolley problem and analyze its ethical implications."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=4096,  # increase to 13000 for hard problems
    temperature=0.8,
    repetition_penalty=1.1,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Using the reasoning toggle via Ollama API

curl http://localhost:11434/api/chat -d '{
  "model": "rjmalagon/deephermes-3-llama-3:8b-bf16",
  "messages": [
    {
      "role": "system",
      "content": "You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem."
    },
    {
      "role": "user",
      "content": "What is the optimal algorithm for sorting a nearly-sorted array?"
    }
  ]
}'

The Full DeepHermes 3 Model Family

As of April 2026, Nous Research has released three DeepHermes 3 variants:

Model Base Release HuggingFace Context
DeepHermes-3-Llama-3-8B-Preview Meta Llama 3.1 8B Feb 14, 2025 Link 131,072 tokens
DeepHermes-3-Llama-3-3B-Preview Meta Llama 3.1 3B Mar 13, 2025 Link 131,072 tokens
DeepHermes-3-Mistral-24B-Preview Mistral Small 3 24B Mar 13, 2025 OpenRouter 128K tokens

All three share the same hybrid-reasoning design: the same system prompt for toggling, the same function-calling schema, and the same Llama-Chat format for multi-turn conversations. The 24B is the highest-capability variant but requires a workstation GPU or CPU-only inference with 32+ GB RAM (expect ~3–5 tokens/second CPU-only).

Performance and Benchmarks

DeepHermes 3 8B was evaluated by Nous Research against Llama 3.1 8B Instruct and Hermes 3 across standard benchmarks. Key findings as of the February 2025 model card:

  • With reasoning mode ON: significant improvement on math and logical reasoning benchmarks compared to Llama 3.1 8B Instruct, measured using HuggingFace's open-r1 evaluation suite.
  • With reasoning mode OFF: matches or outperforms Llama 3.1 8B Instruct on standard conversational benchmarks, indicating the reasoning training did not degrade normal-mode quality.
  • MATH benchmark: DeepHermes 3 8B scores approximately 67% — lower than DeepSeek R1-distilled models (89.1%) but competitive for an 8B model with toggleable reasoning.
  • Token speed (real-world): MacBook Pro M4 Max achieves ~29 tokens/second. RTX 4090 achieves up to 90+ tokens/second. RTX 3080 (10 GB) achieves 35–45 tokens/second for the Q4 8B variant.

For current MMLU-Pro scores across all local models, see the Artificial Analysis MMLU-Pro leaderboard.

How to Choose: DeepHermes 3 vs Alternatives

DeepHermes 3 is not always the right tool. Here is an honest comparison:

Use case Best choice Why
Consumer laptop, 8 GB RAM, no GPU DeepHermes 3B or hermes3:3b Smallest footprint, still supports reasoning toggle
Desktop GPU 8–12 GB VRAM DeepHermes 3 8B (Q4_K_M) Best quality/speed tradeoff on mid-range hardware
Coding and coding only Qwen 3 Coder or DeepSeek V3 Code-specialized models outperform general models on HumanEval
Maximum reasoning quality, no hardware limit Hermes 4.3 36B or Llama 4 Scout Newer models with larger post-training corpus
Function calling at scale (production API) Hermes 3 70B via vLLM Better function-calling reliability at larger scale
Privacy-sensitive, offline use DeepHermes 3 8B (any quantization) 100% local, no data leaves your machine

Step 4 — Managing Models

List all downloaded models:

ollama list

Remove a model to free disk space:

ollama rm rjmalagon/deephermes-3-llama-3:8b-bf16

Pull a model without running it (useful for pre-staging):

ollama pull rjmalagon/deephermes-3-llama-3:8b-bf16

Show model details including quantization and parameter count:

ollama show rjmalagon/deephermes-3-llama-3:8b-bf16

Optional — Open WebUI (Browser Interface)

If you prefer a ChatGPT-like browser interface rather than terminal interaction, install Open WebUI alongside Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Select DeepHermes 3 from the model dropdown, and paste the reasoning-mode system prompt into the system message field to enable deep thinking.

Troubleshooting

GPU not detected / running on CPU only

  • NVIDIA: Confirm nvidia-smi works. If Ollama was installed before your NVIDIA drivers, reinstall Ollama: curl -fsSL https://ollama.com/install.sh | sh. Verify with ollama ps — the GPU column should show your device.
  • AMD: Ensure ROCm v7 is installed. Run rocm-smi to check. On unsupported AMD GPUs, set HSA_OVERRIDE_GFX_VERSION=10.3.0 in the Ollama service environment before restarting.
  • After driver install: Always reboot before testing GPU detection. Kernel module changes require a full restart.

Out of memory (OOM) / model crashes

  • Switch to a lower quantization. For the 8B model, try Q4_K_M instead of BF16 — this reduces VRAM from ~16 GB to ~5 GB with minimal quality loss.
  • Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests and reduce peak VRAM usage.
  • For CPU-only inference, set OLLAMA_MAX_LOADED_MODELS=1 to prevent memory fragmentation.

Reasoning mode not activating

  • The reasoning toggle is activated by the system prompt only, not by a user message. Ensure your API call or chat client is setting the system role message correctly.
  • Some chat UIs ignore system prompts. If the model never produces <think> tags, check that the system prompt is being sent in the API payload.
  • Reasoning may fail to persist in very long conversations if the system prompt rolls out of the context window. For long sessions, periodically re-inject the system prompt or use a shorter context.

Slow inference (<5 tokens/second)

  • Confirm GPU is active (see above). CPU-only inference is expected to be slow.
  • Enable Flash Attention if your GPU supports it: add OLLAMA_FLASH_ATTENTION=1 to the Ollama service environment.
  • Use Q4_K_M quantization — it is typically faster than Q8 while producing only marginally lower quality output.
  • For NVIDIA RTX 4090, expect 90+ tokens/second with the 8B Q4 model. RTX 3080 (10 GB): 35–45 tokens/second.

Model download fails / network errors

  • Ollama model downloads are resumable. If a download fails, re-run ollama pull — it picks up where it left off.
  • For air-gapped environments, download the GGUF file on a connected machine and use ollama create with a Modelfile to import it locally.

What was removed from the original guide and why

The original version of this post included a 600-line Arch Linux installation walkthrough (disk partitioning with gdisk, GRUB bootloader setup, locale configuration, etc.). This was removed because:

  • Arch Linux is not required to run DeepHermes 3. Ollama works identically on Ubuntu, Debian, Fedora, and every other major Linux distribution.
  • The Arch installation steps were generic and predated the current Arch wiki, which is a better reference for Arch-specific setup.
  • The original guide also incorrectly used ollama run deepseek-r1:70b as the DeepHermes 3 run command — DeepSeek R1 is an entirely different model from a different organization.

FAQ

Is DeepHermes 3 free to use?

Yes. All DeepHermes 3 variants are open-weight models released under the Llama 3 license (8B and 3B) or Mistral license (24B). You can download, run, and modify them locally at no cost. The Nous Research Portal API has its own pricing if you prefer API access over local inference.

Does DeepHermes 3 require an internet connection after setup?

No. Once the model is downloaded via ollama pull, it runs entirely offline. No data is sent to any external server during inference.

What is the difference between DeepHermes 3 and Hermes 3?

Hermes 3 (August 2024) is a pure instruction-following model without a reasoning mode. DeepHermes 3 (February 2025) adds toggleable chain-of-thought reasoning via a system prompt. Both share the same base training data (the Hermes datamix), but DeepHermes 3 was additionally trained on approximately 150,000 chain-of-thought reasoning examples.

Can I use DeepHermes 3 for function calling?

Yes. All three DeepHermes 3 variants support function calling (tool use) using the same JSON schema format as Hermes 3. See the Ollama Hermes integration docs for the schema format. Note that combining reasoning mode with function calling can improve accuracy but results may be inconsistent — reasoning mode may produce extra output that needs to be parsed out of the tool call response.

What happened to the 70B DeepHermes 3 model I saw mentioned?

No 70B DeepHermes 3 variant was officially released. The original guide incorrectly referenced a 70B model — the 70B mentioned was actually deepseek-r1:70b, a completely separate model. The largest official DeepHermes 3 variant is the 24B. If you need 70B+ parameter reasoning on consumer hardware, consider Hermes 3 70B or Llama 4 Scout.

How do I run DeepHermes 3 24B if I only have 16 GB RAM?

You cannot run the 24B model comfortably with 16 GB RAM — even at Q4_K_M quantization it requires approximately 14 GB, leaving little headroom for the OS. Use the 8B model instead, which runs at Q4_K_M with ~5 GB, or upgrade RAM before attempting the 24B. CPU-only inference of the 24B with 32 GB RAM is possible but expect 2–4 tokens/second.

Can I use DeepHermes 3 with Open WebUI?

Yes. Once Ollama is running and the model is pulled, Open WebUI automatically lists it in the model selector. To use reasoning mode in the UI, set the system prompt in Open WebUI's system prompt settings (gear icon → System Prompt) before starting a conversation.

Is DeepHermes 3 still the best local reasoning model in 2026?

For an 8B model on consumer hardware, DeepHermes 3 8B is still a strong choice, but the landscape has evolved. Hermes 4.3 (36B, August 2025), Llama 4 Scout, and Qwen 3.5 all offer stronger reasoning at various size points. For maximum reasoning quality that still runs locally, Hermes 4.3 is Nous Research's current recommendation. DeepHermes 3 remains the best option at the sub-10B parameter range for hardware-constrained setups.

References & Further Reading

  1. NousResearch/DeepHermes-3-Llama-3-8B-Preview — Hugging Face model card
  2. NousResearch/DeepHermes-3-Llama-3-3B-Preview — Hugging Face model card
  3. DeepHermes 3 Mistral 24B Preview — OpenRouter
  4. Ollama GitHub Releases — official changelog
  5. Ollama GPU Hardware Support Documentation
  6. MMLU-Pro Benchmark Leaderboard — Artificial Analysis
  7. VentureBeat: Nous Research launches DeepHermes-3
  8. Hermes 4 — Nous Research

Need to integrate a local LLM like DeepHermes 3 into a production application? Codersera connects you with vetted remote AI engineers who have hands-on experience with Ollama, vLLM, and LLM deployment on Linux.