Install and Run Orpheus 3B TTS on macOS (April 2026): The Apple Silicon Guide

Install and Run Orpheus 3B TTS on macOS (April 2026): The Apple Silicon Guide
Quick answer. Canopy Labs' Orpheus 3B TTS does not run natively on Apple Silicon because the official Python package needs CUDA-built PyTorch and vLLM. The working 2026 path on M-series Macs is the GGUF build loaded in LM Studio with the Metal backend, driven by the isaiahbjork/orpheus-tts-local Python client that decodes SNAC tokens for voice cloning and emotion tags.

Last updated April 2026 — refreshed for current model versions, the multilingual research preview, and the Apple Silicon-friendly install path via LM Studio.

Orpheus 3B is Canopy Labs' open-source, Llama-3.2-based speech model that emits expressive, near-human audio with zero-shot voice cloning and inline emotion tags. The original Canopy Python package ships a CUDA-only path that does not run natively on Apple Silicon — so on an M-series Mac the practical install route in 2026 is the GGUF build through LM Studio (Metal-accelerated) plus a small Python client. This guide walks through that path end-to-end with verified version numbers, RAM/disk numbers, and the exact gotchas reported in GitHub issue #178.

What changed in 2026Apple Silicon path is GGUF-only. The upstream orpheus-speech pip package depends on vllm and a CUDA-built PyTorch; on M-series it raises "Torch not compiled with CUDA enabled." Use the GGUF client (isaiahbjork/orpheus-tts-local) on top of LM Studio with Metal instead — that is what Canopy Labs themselves point Mac users to in the model-card discussion thread.Multilingual research preview shipped (April 2025). Canopy released a multilingual family in English, French, Spanish, Italian, German, Mandarin, Korean, and Hindi — eight languages, twenty-four voices total.The tts["audio"] snippet from the 2025 version of this post was wrong. Orpheus is not a Hugging Face pipeline("text-to-speech") model. It emits SNAC tokens that have to be decoded through hubertsiuzdak/snac_24khz. The corrected code path is below.ElevenLabs v3 went GA on 14 March 2026, raising the closed-source bar for expressiveness. Orpheus is still the strongest fully-local option that ships voice cloning + emotion tags under Apache 2.0.Compatibility caveat — the upstream Canopy repo pinned vllm==0.7.3 after a March 2025 vllm regression; newer Linux/CUDA users hit it too. Mac users sidestep this entirely on the GGUF path.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR — the fastest working path on a Mac

QuestionAnswer
Does Canopy's official Python package run on Apple Silicon?No. It needs CUDA-built PyTorch and vLLM. Issue #178 is still open.
What works on Mac in 2026?LM Studio (Metal backend) + isaiahbjork/orpheus-tts-local Python client.
Recommended quantorpheus-3b-0.1-ft-Q4_K_M-GGUF (~2.5 GB on disk, fits 8 GB unified memory).
Preferred Mac hardwareM2 Pro / M3 / M4 with 16 GB unified memory; 8 GB works for Q4_K_M but slower.
Audio output24 kHz mono WAV via the SNAC decoder.
Built-in English voicestara, leah, jess, leo, dan, mia, zac, zoe (8).
LicenseApache 2.0 — production use allowed.

What Orpheus actually is (architecture, in 60 seconds)

Orpheus is a fine-tune of Meta's Llama-3.2-3B-Instruct (so it carries roughly 4B parameters once you count the speech-token vocabulary expansion) that emits SNAC audio tokens instead of text. SNAC is a hierarchical neural audio codec at 24 kHz; the decoder is hubertsiuzdak/snac_24khz. Generation produces 7 SNAC tokens per audio frame, which a CNN detokenizer turns into a PCM waveform. That is why a naive pipeline("text-to-speech") call returns the wrong shape — Orpheus is structurally a speech-LLM, not a Tacotron-style HF TTS pipeline.

Useful consequences of this design:

  • Streaming latency is ~200 ms (down to ~100 ms with input streaming) — competitive with closed real-time APIs.
  • Because it is "just" a Llama with extra tokens, every Llama-class quantization (GGUF Q4_K_M, Q5_K_M, Q8_0) and every Llama runtime (llama.cpp, LM Studio, MLX-LM) can host the weights.
  • Inline emotion tags (<laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>) are first-class — they were trained, not post-hoc prompt tricks.

System requirements (verified, April 2026)

ComponentMinimumRecommended
macOS14 Sonoma15 Sequoia or 26 Tahoe
CPU/GPUApple Silicon M1M2 Pro / M3 / M4 (Metal acceleration)
Unified memory (Q4_K_M)8 GB16 GB
Unified memory (Q8_0)16 GB24 GB
Disk~3 GB for Q4_K_M, ~5 GB for Q8_0, plus ~600 MB for SNAC + dependencies
Python3.103.11 or 3.12
LM Studio0.3.x or newerlatest stable

For reference, the per-quant footprints on the GGUF model card are: Q4_K_S 2.40 GB, Q4_K_M 2.49 GB, Q8_0 4.03 GB. Add ~500 MB at runtime for the SNAC decoder and Python overhead.

This is the path Canopy Labs themselves recommend in the model-card discussion thread for Mac users. It avoids vLLM and CUDA-built PyTorch entirely.

Step 1 — Install Homebrew, Python, and Git

Skip the ones you already have.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11 git git-lfs
git lfs install

Step 2 — Install LM Studio and load the Orpheus GGUF

  1. Download LM Studio from lmstudio.ai (universal Apple Silicon build).
  2. Open it, hit the search icon, and search for isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF. Download.
  3. Switch to the Developer tab. Load the model. Confirm Metal is the active backend (not CPU).
  4. Start the local server. Default endpoint is http://127.0.0.1:1234/v1.

If you prefer raw llama.cpp or llama-server, the same GGUF will load. Pass --rope-scaling=linear and a --ctx-size matching your ORPHEUS_MAX_TOKENS so long passages don't get truncated. llama.cpp upstream tracks Orpheus support in issue #12476.

Step 3 — Clone the Python client

git clone https://github.com/isaiahbjork/orpheus-tts-local.git
cd orpheus-tts-local
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The client is small — it speaks to LM Studio's OpenAI-compatible endpoint, parses the SNAC-token stream, decodes through hubertsiuzdak/snac_24khz on the CPU/Metal, and writes a 24 kHz mono WAV.

Step 4 — Generate speech

python gguf_orpheus.py \
  --text "Hello from a Mac. <laugh> This is Orpheus running fully offline." \
  --voice tara \
  --output hello.wav

Open hello.wav in QuickTime or Music. Expect 1.5–4× real-time on an M3 with 16 GB unified memory using Q4_K_M; closer to real-time on an 8 GB M1.

Step 5 — If you must call it from your own Python

The 2025 version of this guide shipped a pipeline("text-to-speech") snippet that returns {"audio": ...}. That code does not work for Orpheus and never did — Orpheus is not a registered HF TTS pipeline. Here is the corrected path, talking to the LM Studio server directly:

import requests, json, wave, numpy as np
from snac import SNAC  # pip install snac

LM_STUDIO = "http://127.0.0.1:1234/v1/completions"
MODEL = "orpheus-3b-0.1-ft"
PROMPT = "tara: Hello from a Mac. <laugh> This is Orpheus running offline."

# 1. Stream audio tokens from LM Studio
r = requests.post(LM_STUDIO, json={
    "model": MODEL,
    "prompt": PROMPT,
    "max_tokens": 1200,
    "temperature": 0.6,
    "top_p": 0.9,
    "stream": False,
})
token_text = r.json()["choices"][0]["text"]

# 2. Parse SNAC token IDs (the client repo has a robust parser; this is a sketch)
ids = [int(t) for t in token_text.split() if t.isdigit()]
frames = [ids[i:i+7] for i in range(0, len(ids) - 6, 7)]

# 3. Decode through SNAC at 24 kHz
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
audio = snac_model.decode(frames)  # shape: (1, 1, samples)

# 4. Write a 24 kHz mono WAV
pcm = (audio.squeeze().cpu().numpy() * 32767).astype(np.int16)
with wave.open("output.wav", "wb") as f:
    f.setnchannels(1); f.setsampwidth(2); f.setframerate(24000)
    f.writeframes(pcm.tobytes())

If this looks like more wiring than you wanted, that is exactly why the orpheus-tts-local client exists — it bundles the SNAC parsing and the WAV write so you only call gguf_orpheus.py. Use the snippet above only when you need to embed Orpheus inside a larger Python service, e.g. a voice agent that already orchestrates an LLM through the same OpenAI-compatible interface — the same pattern we describe in the OpenClaw + Ollama setup guide for running local AI agents.

What was removed from the 2025 version (and why)

  • tts = pipeline("text-to-speech", model="canopylabs/orpheus-3b-0.1-pretrained") — Orpheus does not register a TTS pipeline. The pipeline returns text-completion logits, not audio. Removed.
  • git lfs pull on the Canopy repo to "download the model" — the canopy repo doesn't ship weights via LFS; weights live on Hugging Face. Removed.
  • "At least 8 GB RAM, GPU recommended" with no per-quant numbers — replaced with the per-quant table above.
  • The pretrained base as a usage default — the -ft (production fine-tune) is what every consumer integration ships; -pretrained is for downstream fine-tuning only. Updated.

Voices, emotion tags, and prompt format

The fine-tuned English model exposes 8 voices. Pick one by prefixing the prompt with the voice name and a colon:

VoiceGenderCharacter
taraFDefault — conversational, clear (most demos use this)
leahFWarm, gentle
jessFEnergetic, youthful
leoMAuthoritative, deep
danMFriendly, casual
miaFProfessional, articulate
zacMEnthusiastic, dynamic
zoeFCalm, soothing

Emotion tags are inline. The model was trained on these — they are not post-hoc filters:

  • <laugh>, <chuckle>, <giggle>
  • <sigh>, <groan>, <yawn>, <gasp>
  • <cough>, <sniffle>

Example prompt:

leo: I cannot believe you said that. <sigh> Let's start over.

Multilingual: French, German, Spanish, Italian, Mandarin, Korean, Hindi

Canopy's April 2025 multilingual research preview ships eight languages and 24 voices in total. The models are listed under the canopylabs/orpheus-tts Hugging Face collection, e.g. canopylabs/3b-fr-ft-research_release for French and canopylabs/3b-hi-pretrain-research_release for Hindi. Multilingual quants are not as well-curated as English; expect to use the upstream BF16 weights on Mac through MLX or full-precision llama.cpp if you need a non-English voice. Treat this tier as research-preview, not production.

How to choose: should you use Orpheus, Kokoro, Sesame CSM, or ElevenLabs v3?

If you need…PickWhy
Fully local, voice cloning + emotion tagsOrpheus 3BOnly fully-local Apache-2.0 model that ships both
Smallest footprint, fastest CPU TTSKokoro 82M~30–45 s for a 1,500-word passage on M1 8 GB
Most natural conversational tone, paralinguisticsSesame CSM 1BBest at non-verbal cues; weaker voice cloning OOTB
Maximum expressiveness, no latency concernElevenLabs v3 (cloud)GA 14 March 2026; 70+ languages; not real-time
Real-time conversational agent (cloud OK)ElevenLabs Flash v2.5~75 ms latency; lower quality than v3

If you want a head-to-head with concrete numbers, our Orpheus 3B vs Kokoro and Orpheus vs Sesame CSM 1B comparisons go deeper on the trade-offs.

Performance — concrete 2026 numbers

From CodeSOTA's 2026 speech leaderboard and Inferless's 2025 12-model comparison (links in References):

  • Quality (MOS): Orpheus-3b-0.1-ft scores ~4.2 — close to Sesame CSM 1B and within striking distance of ElevenLabs v2 on conversational text. ElevenLabs v3 (closed) sits at the top of the leaderboard.
  • Word-error rate (CER): ~21% in the open-source comparison; Kokoro is lower at 17% but lacks voice cloning.
  • Streaming latency: ~200 ms time-to-first-audio with batched generation, ~100 ms with input streaming, on RTX-class hardware. On M3/M4 expect 1.5–4× real-time generation with Q4_K_M, depending on context length.
  • Audio: 24 kHz, 16-bit mono — fine for voice agents and audiobooks; not 48 kHz studio quality.

If you need cite-able benchmark detail, the CodeSOTA leaderboard is the most current public ranking that includes Orpheus.

Common pitfalls and troubleshooting

  • RuntimeError: Torch not compiled with CUDA enabled — you're running the upstream orpheus-speech package on Apple Silicon. Switch to the GGUF + LM Studio path. This is the issue #178 case.
  • vLLM regression on Linux — pin vllm==0.7.3. The Canopy README still calls this out. (Not relevant for Mac users on the GGUF path.)
  • Garbled or robotic audio — your prompt is being treated as text completion, not as a voice prompt. Make sure the prompt starts with voicename: (e.g. tara:) and that the LM Studio server is loading the Orpheus GGUF, not a base Llama-3.2.
  • WAV is silent or 0 bytes — the SNAC frame parsing stopped early. Increase --max_tokens (the Lex-au server defaults to ORPHEUS_MAX_TOKENS, often 1024 — bump to 4096 for paragraph-length output).
  • Hugging Face login prompts — only required for gated multilingual research-release weights. The English Q4_K_M GGUF is not gated; you do not need a token for it.
  • "Connection refused" from the Python client — LM Studio's local server is off. Re-open the Developer tab and click Start server. Confirm with curl http://127.0.0.1:1234/v1/models.
  • Output is too fast / too slow — temperature and top_p matter for SNAC sampling. Canopy recommends temperature=0.6, top_p=0.9, repetition_penalty=1.1. Defaults of 0.0 produce monotone deliveries.
  • Long generations cut off mid-word — increase --ctx-size in llama.cpp or LM Studio. Each second of audio uses ~150 SNAC tokens; a 30-second clip needs ~4500 tokens of headroom.

Production notes

  • License: Apache 2.0 on both code and weights. The base Llama-3.2 license still governs the underlying weights; if you ship a commercial product, read both. Canopy's -ft models are research-friendly and explicitly cleared for commercial use under Apache 2.0.
  • Throughput: for batch inference, run on a Linux box with vLLM and fp8 (Baseten partnership, May 2025). Apple Silicon is the right tool for one-user-at-a-time interactive use, not for scale-out batch synthesis.
  • Voice cloning ethics: Orpheus does zero-shot cloning from ~10 seconds of reference audio. Get explicit consent from the speaker; in the EU, recorded voice is biometric data under the AI Act.
  • If you're stitching Orpheus into a voice agent (Whisper-cpp for STT, a local LLM for reasoning, Orpheus for TTS), the architecture mirrors the one we describe in the OpenClaw vs LM Studio vs Ollama comparison. Hiring a Codersera engineer who has shipped this exact stack before usually saves the week of vLLM/Metal yak-shaving.

FAQ

Does Orpheus 3B run natively on Apple Silicon as of April 2026?

Not via the official Canopy Python package — it requires CUDA-built PyTorch and vLLM. The community path is GGUF through LM Studio (Metal) plus the orpheus-tts-local client. Issue #178 in the Canopy repo tracks the native ask but is unresolved.

How much RAM do I really need?

8 GB unified memory is enough for Q4_K_M Orpheus-3B-FT, but you'll see Metal eviction during long generations. 16 GB is the comfortable target. 24 GB or more lets you run Q8_0 plus a small reasoning LLM in parallel.

Can I clone my own voice?

Yes. Provide ~5–10 seconds of clean 24 kHz reference audio and prepend it to the prompt as a voice exemplar. The Canopy README has an explicit zero-shot-cloning recipe; quality scales with sample length and recording cleanliness.

What's the difference between orpheus-3b-0.1-pretrained and orpheus-3b-0.1-ft?

-pretrained is the 100k-hour base — useful as a starting point for fine-tunes and downstream tasks. -ft is the production fine-tune with the 8 named voices and emotion-tag training. For TTS, always use -ft.

Does it support languages other than English?

Yes — the April 2025 multilingual research preview adds French, German, Spanish, Italian, Mandarin, Korean, and Hindi (24 voices total). Treat them as research-grade; English is the only model that hit production maturity.

How does Orpheus compare to ElevenLabs v3?

ElevenLabs v3 (GA 14 March 2026) is more expressive and supports 70+ languages, but is closed-source, costs $0.17–$0.30 per 1,000 characters, and explicitly trades latency for quality (not real-time). Orpheus is the leading fully-local Apache-2.0 alternative with voice cloning and emotion tags.

Can I use Orpheus commercially?

Yes — both code and weights are Apache 2.0. The base Llama-3.2 license still applies to the underlying weights, so include attribution. Voice cloning specifically is your responsibility re: consent and biometrics regulation.

Why is my output choppy or popping at frame boundaries?

SNAC's standard decoder doesn't smooth across frame boundaries; Canopy ships a sliding-window CNN decoder for streaming. If you're rolling your own pipeline (not using orpheus-tts-local), make sure you're using their detokenizer, not the vanilla SNAC one.

References & further reading