Run DeepHermes 3 on macOS: Step-by-Step Installation Guide (2026)

Run DeepHermes 3 on macOS: Step-by-Step Installation Guide (2026)

Last updated April 2026 — refreshed for current model versions, Ollama v0.22, and macOS Sequoia compatibility.

DeepHermes 3 is NousResearch's hybrid reasoning model that lets you toggle between fast intuitive responses and extended chain-of-thought reasoning within a single model. This guide covers every practical method for running it locally on macOS in 2026 — from a two-command Ollama install to a direct Hugging Face setup using Python — with real hardware numbers and working code throughout.

What changed in 2026 — read this if you used an earlier guideOllama is now the fastest path on macOS. ollama run deephermes3 (or the Hermes 3 alias) works in under five minutes with no Python setup. Ollama v0.22 (April 2026) requires macOS 14 Sonoma or later.macOS 12 (Monterey) is no longer a valid target. Ollama v0.19+ requires macOS 14+. If you are on Monterey or Ventura, upgrade before following this guide.Python 3.13 is the current stable release (October 2025); the old recommendation of Python 3.8 is end-of-life and unsupported by current PyTorch and Transformers builds. Use Python 3.11 or 3.12 for best compatibility with transformers and torch.A 24B model variant now exists. DeepHermes 3 Mistral 24B Preview (Apache 2.0 license) is available alongside the original 8B and the smaller 3B. The 24B fits in 16 GB of unified memory at Q4_K_M quantization.Ollama now has an MLX backend in preview (since v0.19), delivering roughly 2× faster generation on M3/M4/M5 chips versus the older llama.cpp Metal path for some models.The "System Preferences" path is gone. macOS 13 Ventura renamed it to "System Settings"; all Privacy & Security controls are in System Settings → Privacy & Security on macOS 14+.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — Which Method to Use

Goal Recommended method Time to first response
Just try the model now Ollama CLI ~5 min
GUI, no command line LM Studio ~10 min
Integrate into a Python project Hugging Face Transformers 20–40 min (model download)
OpenAI-compatible REST API locally Ollama serve + any HTTP client ~5 min
Maximum throughput, M3/M4/M5 llama.cpp with MLX backend 15–30 min

About DeepHermes 3

DeepHermes 3 Preview is Nous Research's flagship model series, first released in February 2025. It is distilled from DeepSeek R1, which gives it genuine chain-of-thought reasoning alongside standard conversational capabilities — all in one model, no separate model swap required.

Model Variants

Model Parameters Base License Min RAM (Q4_K_M) Hugging Face ID
DeepHermes 3 Llama 3B Preview 3B Llama 3.1 Llama 3 4 GB NousResearch/DeepHermes-3-Llama-3-3B-Preview
DeepHermes 3 Llama 8B Preview 8B Llama 3.1 Llama 3 8 GB NousResearch/DeepHermes-3-Llama-3-8B-Preview
DeepHermes 3 Mistral 24B Preview 24B Mistral Small 24B Apache 2.0 16 GB NousResearch/DeepHermes-3-Mistral-24B-Preview

GGUF-quantized versions of all three variants are available on Hugging Face directly from NousResearch, as well as community re-quantizations from bartowski and DevQuasar. Ollama's hermes3 library entry additionally offers 70B and 405B sizes (based on the Hermes 3 non-reasoning line).

The Reasoning Toggle

DeepHermes 3's signature feature is mode switching via system prompt. Sending the "deep thinking AI" system prompt activates chains of thought enclosed in <think> tags, with up to 13,000 thinking tokens available for complex problems. Sending a plain system prompt (or none) reverts to fast conversational mode. No model reload required.


Prerequisites and System Requirements

Hardware

  • Apple Silicon (M1 and later): All models benefit from Metal GPU acceleration. Unified memory means the GPU and CPU share RAM, so larger memory configs directly translate to larger runnable models. An M3 Pro with 18 GB handles the 8B model at full speed; 36 GB handles the 24B model.
  • Intel Mac: Supported by Ollama and llama.cpp (CPU-only on Intel; no Metal GPU acceleration). Expect 3–8 tokens/s on an 8B Q4_K_M model, versus 30–50 tokens/s on M2/M3 Apple Silicon.
  • RAM minimums: 8 GB for the 3B model, 16 GB for the 8B model (8 GB is technically possible but leaves no headroom), 32 GB for the 24B model.
  • Disk space: Allow 5–15 GB per model depending on variant and quantization. GGUF Q4_K_M at 8B occupies roughly 4.7 GB.

Operating System

  • Ollama path: macOS 14 Sonoma or later (required as of Ollama v0.19). macOS 15 Sequoia fully supported.
  • Python/Transformers path: macOS 12 Monterey or later technically works, but Python 3.11+ packages are better tested on Ventura and Sonoma. Use Sonoma if possible.

Software

  • Homebrew — the package manager used throughout this guide. Install from brew.sh if not already present.
  • Python 3.11 or 3.12 — for the Transformers path. Python 3.8 is end-of-life; do not use it.
  • Xcode Command Line Tools — required by Homebrew and by llama.cpp builds. Install with xcode-select --install.

Ollama wraps llama.cpp (and the new MLX backend for Apple Silicon) in a Docker-like CLI. It manages model downloads, quantization selection, and the inference server. This is the fastest path from zero to a running DeepHermes 3 session.

Step 1: Install Ollama

brew install ollama

Alternatively, download the Ollama.dmg from ollama.com/download/mac and drag it to Applications. The GUI app and the CLI are bundled together.

Verify the install:

ollama --version
# ollama version 0.22.0 (or current)

Step 2: Pull and Run DeepHermes 3

Ollama's library hosts Hermes 3 (the non-reasoning predecessor), which shares the same prompt format. For the full reasoning-capable DeepHermes 3 8B, pull the GGUF directly using a custom Modelfile, or use the community tag:

# Easiest: run the Hermes 3 8B (standard Hermes line)
ollama run hermes3

# Run the smaller 3B variant
ollama run hermes3:3b

# For the full 24B variant
ollama run hermes3:70b   # for largest available in standard Hermes line

To run DeepHermes 3 Llama 8B Preview specifically (the reasoning model), create a Modelfile pointing at the GGUF:

# Download the GGUF from Hugging Face first
huggingface-cli download NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF \
  DeepHermes-3-Llama-3-8B-Preview-Q4_K_M.gguf --local-dir ~/models/deephermes3

# Create a Modelfile
cat > ~/models/deephermes3/Modelfile <<'EOF'
FROM ./DeepHermes-3-Llama-3-8B-Preview-Q4_K_M.gguf
PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM "You are Hermes, an AI assistant"
EOF

# Register and run it
ollama create deephermes3-8b -f ~/models/deephermes3/Modelfile
ollama run deephermes3-8b

Install the Hugging Face CLI if needed: pip install huggingface_hub.

Step 3: Start the API Server

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434. Start it in the background:

ollama serve &

Then make a request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deephermes3-8b",
    "messages": [
      {"role": "system", "content": "You are Hermes, an AI assistant"},
      {"role": "user", "content": "Explain the difference between mutex and semaphore"}
    ]
  }'

Activating Reasoning Mode via Ollama

Pass the deep-thinking system prompt to switch the model into extended reasoning mode:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deephermes3-8b",
    "messages": [
      {
        "role": "system",
        "content": "You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem."
      },
      {"role": "user", "content": "Prove that the square root of 2 is irrational."}
    ]
  }'

Method 2 — LM Studio (GUI, No Command Line)

LM Studio is a desktop application for macOS that provides a model browser connected directly to Hugging Face. It is the right choice if you prefer not to use Terminal.

  1. Download LM Studio from lmstudio.ai. Requires macOS 13 Ventura or later.
  2. Open the app and click Search. Type DeepHermes-3 in the search bar.
  3. Select NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF and choose a quantization. Q4_K_M is a good default (4.7 GB, good quality/speed balance). Q5_K_M is preferable if you have 16 GB of unified memory and want higher fidelity output.
  4. Click Download and wait for the download to complete.
  5. Click Load Model, then switch to the Chat tab to begin a session.

LM Studio also exposes a local server at http://localhost:1234 with an OpenAI-compatible API — enable it in the Local Server tab if you want to connect other applications.


Method 3 — Hugging Face Transformers (Python)

Use this path when you need to call DeepHermes 3 from Python code — for fine-tuning, evaluation pipelines, or integration into a larger application. This method downloads the full BF16 weights from Hugging Face, which is larger (the 8B model is ~16 GB) than the quantized GGUF version.

Step 1: Set Up Python

# Install Python 3.12 (recommended; 3.11 also works)
brew install python@3.12

# Verify
python3.12 --version
# Python 3.12.x

# Create a virtual environment
python3.12 -m venv ~/.venvs/deephermes
source ~/.venvs/deephermes/bin/activate

Step 2: Install Dependencies

pip install --upgrade pip
pip install torch transformers accelerate huggingface_hub

On Apple Silicon, PyTorch uses the Metal Performance Shaders (MPS) backend. No separate CUDA or ROCm installation is needed.

Step 3: Download the Model

from huggingface_hub import snapshot_download

# Download the 8B model (approx 16 GB in BF16)
snapshot_download(
    repo_id="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
    local_dir="./deephermes3-8b"
)

For the 24B model, replace the repo ID with NousResearch/DeepHermes-3-Mistral-24B-Preview. Expect approximately 48 GB for the full BF16 weights.

Step 4: Run the Model — Intuitive Mode

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "./deephermes3-8b"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="mps",          # Use Apple Silicon GPU via MPS
)

messages = [
    {"role": "system", "content": "You are Hermes, an AI assistant"},
    {"role": "user", "content": "Summarise the CAP theorem in three bullet points."},
]

input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("mps")

generated_ids = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.8,
    repetition_penalty=1.1,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(response)

Step 5: Run the Model — Deep Reasoning Mode

REASONING_SYSTEM_PROMPT = (
    "You are a deep thinking AI, you may use extremely long chains of thought to deeply consider "
    "the problem and deliberate with yourself via systematic reasoning processes to help come to a "
    "correct solution prior to answering. You should enclose your thoughts and internal monologue "
    "inside <think> </think> tags, and then provide your solution or response to the problem."
)

messages = [
    {"role": "system", "content": REASONING_SYSTEM_PROMPT},
    {"role": "user", "content": "What is y if y = 2*2 - 4 + (3*2)?"},
]

input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("mps")

generated_ids = model.generate(
    input_ids,
    max_new_tokens=2500,   # Allow ample space for reasoning chains
    temperature=0.8,
    repetition_penalty=1.1,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_space=True)
print(response)

The model's internal deliberation appears inside <think>...</think> tags before the final answer. You can strip these programmatically if you only want the answer.


Method 4 — llama.cpp (Maximum Control)

llama.cpp is the underlying inference engine that Ollama uses. Running it directly gives you fine-grained control over quantization, context size, batch parameters, and Metal layer offloading.

# Install llama.cpp with Metal support
brew install llama.cpp

# Download a GGUF (Q4_K_M recommended for 8B)
huggingface-cli download NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF \
  DeepHermes-3-Llama-3-8B-Preview-Q4_K_M.gguf --local-dir ~/models/deephermes3

# Run with all layers on the GPU (Metal)
llama-cli \
  --model ~/models/deephermes3/DeepHermes-3-Llama-3-8B-Preview-Q4_K_M.gguf \
  --n-gpu-layers -1 \
  --ctx-size 8192 \
  --prompt "You are Hermes, an AI assistant\n\nUser: What is the difference between TCP and UDP?\nAssistant:"

Setting --n-gpu-layers -1 offloads all layers to Metal. The default is 0 (CPU only), which gives a fraction of potential throughput. Always set this flag on Apple Silicon.

For a persistent interactive session:

llama-cli \
  --model ~/models/deephermes3/DeepHermes-3-Llama-3-8B-Preview-Q4_K_M.gguf \
  --n-gpu-layers -1 \
  --ctx-size 8192 \
  --interactive \
  --chat-template llama3

Performance and Benchmarks

Reasoning Mode Improvement

NousResearch reported that DeepHermes 3 shows up to 50% improvement in math reasoning tasks compared to Hermes 3 (the non-reasoning predecessor) when reasoning mode is active. The upper bound is set by the R1 teacher model from which it was distilled.

Approximate Throughput on Apple Silicon (8B Q4_K_M)

Chip Unified Memory Estimated tokens/s (Ollama)
M1 MacBook Air 8 GB 10–15 t/s
M2 MacBook Pro 16 GB 20–30 t/s
M3 Pro 18–36 GB 30–50 t/s
M4 Pro 24–64 GB 45–65 t/s
Intel Mac 16 GB DDR4 3–8 t/s (CPU only)

These are indicative figures based on community benchmarks for 8B-class Q4_K_M GGUF models on Apple Silicon. Actual throughput varies with context length, prompt complexity, and concurrent system load. The Ollama MLX backend (introduced in v0.19, for M5 chips with NVFP4 quantization) can deliver significantly higher prefill speeds — up to 1,810 tokens/s prefill — but that applies to supported model/quantization combinations on the latest hardware.

Quantization Trade-offs

Quantization 8B disk size Quality vs BF16 Use case
Q3_K_S ~3.1 GB Noticeable degradation RAM-constrained 8 GB Macs
Q4_K_M ~4.7 GB Good — recommended default Most users
Q5_K_M ~5.7 GB Very close to full precision 16 GB+ Macs, reasoning tasks
Q8_0 ~8.5 GB Near-lossless 32 GB+ Macs, maximum fidelity
BF16 (full) ~16 GB Reference quality Python/Transformers path

For reasoning tasks, Q5_K_M or higher is worth the extra disk space — the extended chain-of-thought generation benefits more from higher fidelity weights than simple chat does.


Choosing a Method — Decision Guide

  • 16 GB Mac, first time with local LLMs: Use Ollama. Two commands and you are running. Start with the 8B Q4_K_M model.
  • 8 GB Mac: Use the 3B model via Ollama (ollama run hermes3:3b). The 8B model can technically fit at Q3_K_S but leaves almost no room for the OS, causing swapping and very low throughput.
  • You want a chat UI without Terminal: Use LM Studio. It handles discovery, download, and conversation in one app.
  • You are building a Python application or running evaluation pipelines: Use the Hugging Face Transformers path. The OpenAI-compatible wrapper from Ollama also works and avoids the 16 GB download.
  • You need the OpenAI-compatible REST endpoint for an existing tool (Continue.dev, Open WebUI, etc.): Use ollama serve — it exposes http://localhost:11434/v1 which matches the OpenAI API surface.
  • You need maximum throughput for batch processing: Use llama.cpp directly with --n-gpu-layers -1 and tune --batch-size and --ubatch-size for your workload.

macOS Privacy and Permission Notes

macOS 14 Sonoma and 15 Sequoia enforce stricter privacy controls than earlier releases. When you first launch Ollama or run a Python script that accesses your home directory:

  • If prompted about file access, click Allow. Models are stored in ~/.ollama/models by default, which is within your home directory.
  • If you store models in a custom directory (e.g., an external drive), grant Terminal or your application access via System Settings → Privacy & Security → Files and Folders.
  • Full Disk Access is not required for running Ollama or llama.cpp. Only grant it if a specific error message asks for it.
  • App Nap: macOS may throttle background processes. If you are running a long inference task in the background via Terminal, ensure Energy Saver (System Settings → Battery) is not set to aggressively sleep the machine. On a plugged-in MacBook or a Mac desktop, this is rarely an issue.

Common Pitfalls and Troubleshooting

Ollama fails with "model not found" or times out on download

Run ollama pull hermes3 explicitly before ollama run hermes3. The pull command shows download progress and errors clearly. Check your disk space — a partial download will fail silently on retry if the disk is full.

Python MPS errors on Apple Silicon

If you see RuntimeError: MPS backend out of memory, the model does not fit in unified memory at the loaded precision. Options: use torch_dtype=torch.float16 instead of bfloat16, load with load_in_8bit=True (requires pip install bitsandbytes), or switch to a GGUF-based method (Ollama or llama.cpp) which uses quantized weights by default.

Very slow inference on Apple Silicon

Check that Metal is being used. For Ollama, run ollama ps — it shows which device is active. For llama.cpp, confirm --n-gpu-layers -1 is set; the default is CPU-only. For Python, confirm device_map="mps" is set and that your PyTorch version includes MPS support (torch.backends.mps.is_available() should return True).

ModuleNotFoundError for transformers, torch, or accelerate

You are likely outside your virtual environment. Re-activate it: source ~/.venvs/deephermes/bin/activate. Then verify: pip list | grep -E "torch|transformers".

huggingface_hub rate limit or authentication errors

Create a free Hugging Face account and generate a read token at https://huggingface.co/settings/tokens. Run huggingface-cli login and paste the token. The DeepHermes 3 models do not require a license acceptance gate, but a token prevents rate limiting on large downloads.

Intel Mac: very low throughput

Intel Macs have no Metal GPU acceleration for LLM inference. The 8B model at Q4_K_M will run at 3–8 tokens/s, which is usable but slow for interactive chat. Consider the 3B model (hermes3:3b in Ollama, ~2 GB) for a more interactive experience on older hardware.

Reasoning mode produces very long outputs

This is expected. The model may generate thousands of tokens inside <think> tags before reaching the final answer. Set max_new_tokens to at least 2,500–4,000 when reasoning mode is active. Responses will be longer but more accurate for complex tasks.


Function Calling and Structured Output

DeepHermes 3 supports structured tool use. NousResearch maintains a reference implementation at github.com/NousResearch/Hermes-Function-Calling. The model accepts JSON schemas for tool definitions and responds with valid JSON when a tool call is appropriate.

For teams integrating DeepHermes 3 into production agentic workflows — where reliable function calling and multi-turn reasoning are critical — having strong local LLM infrastructure can be complex to maintain. Codersera's vetted AI engineers can help design and implement robust local AI pipelines with proper evaluation and fallback strategies.


Keeping Everything Up to Date

# Update Ollama
brew upgrade ollama

# Update Python packages
source ~/.venvs/deephermes/bin/activate
pip install --upgrade transformers torch accelerate huggingface_hub

# Check for new model versions on Hugging Face
huggingface-cli repo info NousResearch/DeepHermes-3-Llama-3-8B-Preview

NousResearch releases model updates without versioned slugs on Hugging Face (the repo is updated in-place for Preview releases). Watch the NousResearch Hugging Face organization or their X account for announcements of new variants or a DeepHermes 4 release.


FAQ

Does DeepHermes 3 run on an M1 MacBook Air with 8 GB RAM?

The 3B model runs comfortably. The 8B model at Q4_K_M needs approximately 6–7 GB of RAM for the weights plus OS overhead — it fits on 8 GB but leaves very little headroom, which can cause swapping and slow performance. The 3B model at Q4_K_M uses ~2 GB and runs at full speed.

What is the difference between DeepHermes 3 and Hermes 3?

Hermes 3 is a general-purpose instruction model (up to 405B, Llama 3.1 base). DeepHermes 3 is a separate preview series distilled from DeepSeek R1, adding toggleable chain-of-thought reasoning on top of Hermes 3 capabilities. DeepHermes 3 currently tops out at 24B parameters; Hermes 3 goes to 405B.

Is DeepHermes 3 free to use?

The 8B and 3B variants are under the Llama 3 Community License, which allows research and commercial use with restrictions (see Meta's license terms). The 24B Mistral variant is under Apache 2.0, which is fully permissive. Check the relevant Hugging Face model card for full license terms before commercial deployment.

Does it work with Open WebUI or other chat frontends?

Yes. Run ollama serve, then point Open WebUI (or any OpenAI-compatible frontend) at http://localhost:11434/v1. The model name in the UI should match what you registered — e.g., deephermes3-8b if you used the Modelfile approach above.

How do I switch between reasoning and non-reasoning mode in an API call?

Change the system message in your request. The deep-thinking prompt (with the "enclose your thoughts inside <think> tags" instruction) activates reasoning. Any other system prompt (or omitting it entirely) uses standard conversational mode. No model reload or parameter change is needed.

How much disk space do I need for all three model variants?

At Q4_K_M: 3B (~2 GB) + 8B (~4.7 GB) + 24B (~14 GB) = approximately 21 GB total. If you are only interested in one variant, the 8B at Q4_K_M is the best starting point for most tasks.

Can I fine-tune DeepHermes 3 on my Mac?

LoRA fine-tuning of the 8B model on Apple Silicon is possible with mlx-lm (Apple's MLX framework) on Macs with 32 GB+ of unified memory. The Transformers + PEFT path also works but is slower on MPS than on CUDA. Full fine-tuning requires far more VRAM than any Mac currently provides.

What happened to the "System Preferences" path from older macOS guides?

Apple renamed System Preferences to System Settings in macOS 13 Ventura. On macOS 14 Sonoma and 15 Sequoia, all Privacy and Security controls are under System Settings → Privacy & Security. Any guide that references "System Preferences → Security & Privacy" is out of date.


References and Further Reading