Install and Run Orpheus 3B TTS on macOS (April 2026): The Apple Silicon Guide
Last updated April 2026 — refreshed for current model versions, the multilingual research preview, and the Apple Silicon-friendly install path via LM Studio.
Orpheus 3B is Canopy Labs' open-source, Llama-3.2-based speech model that emits expressive, near-human audio with zero-shot voice cloning and inline emotion tags. The original Canopy Python package ships a CUDA-only path that does not run natively on Apple Silicon — so on an M-series Mac the practical install route in 2026 is the GGUF build through LM Studio (Metal-accelerated) plus a small Python client. This guide walks through that path end-to-end with verified version numbers, RAM/disk numbers, and the exact gotchas reported in GitHub issue #178.
What changed in 2026Apple Silicon path is GGUF-only. The upstreamorpheus-speechpip package depends onvllmand a CUDA-built PyTorch; on M-series it raises "Torch not compiled with CUDA enabled." Use the GGUF client (isaiahbjork/orpheus-tts-local) on top of LM Studio with Metal instead — that is what Canopy Labs themselves point Mac users to in the model-card discussion thread.Multilingual research preview shipped (April 2025). Canopy released a multilingual family in English, French, Spanish, Italian, German, Mandarin, Korean, and Hindi — eight languages, twenty-four voices total.Thetts["audio"]snippet from the 2025 version of this post was wrong. Orpheus is not a Hugging Facepipeline("text-to-speech")model. It emits SNAC tokens that have to be decoded throughhubertsiuzdak/snac_24khz. The corrected code path is below.ElevenLabs v3 went GA on 14 March 2026, raising the closed-source bar for expressiveness. Orpheus is still the strongest fully-local option that ships voice cloning + emotion tags under Apache 2.0.Compatibility caveat — the upstream Canopy repo pinnedvllm==0.7.3after a March 2025 vllm regression; newer Linux/CUDA users hit it too. Mac users sidestep this entirely on the GGUF path.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.
TL;DR — the fastest working path on a Mac
| Question | Answer |
|---|---|
| Does Canopy's official Python package run on Apple Silicon? | No. It needs CUDA-built PyTorch and vLLM. Issue #178 is still open. |
| What works on Mac in 2026? | LM Studio (Metal backend) + isaiahbjork/orpheus-tts-local Python client. |
| Recommended quant | orpheus-3b-0.1-ft-Q4_K_M-GGUF (~2.5 GB on disk, fits 8 GB unified memory). |
| Preferred Mac hardware | M2 Pro / M3 / M4 with 16 GB unified memory; 8 GB works for Q4_K_M but slower. |
| Audio output | 24 kHz mono WAV via the SNAC decoder. |
| Built-in English voices | tara, leah, jess, leo, dan, mia, zac, zoe (8). |
| License | Apache 2.0 — production use allowed. |
What Orpheus actually is (architecture, in 60 seconds)
Orpheus is a fine-tune of Meta's Llama-3.2-3B-Instruct (so it carries roughly 4B parameters once you count the speech-token vocabulary expansion) that emits SNAC audio tokens instead of text. SNAC is a hierarchical neural audio codec at 24 kHz; the decoder is hubertsiuzdak/snac_24khz. Generation produces 7 SNAC tokens per audio frame, which a CNN detokenizer turns into a PCM waveform. That is why a naive pipeline("text-to-speech") call returns the wrong shape — Orpheus is structurally a speech-LLM, not a Tacotron-style HF TTS pipeline.
Useful consequences of this design:
- Streaming latency is ~200 ms (down to ~100 ms with input streaming) — competitive with closed real-time APIs.
- Because it is "just" a Llama with extra tokens, every Llama-class quantization (GGUF Q4_K_M, Q5_K_M, Q8_0) and every Llama runtime (llama.cpp, LM Studio, MLX-LM) can host the weights.
- Inline emotion tags (
<laugh>,<chuckle>,<sigh>,<cough>,<sniffle>,<groan>,<yawn>,<gasp>) are first-class — they were trained, not post-hoc prompt tricks.
System requirements (verified, April 2026)
| Component | Minimum | Recommended |
|---|---|---|
| macOS | 14 Sonoma | 15 Sequoia or 26 Tahoe |
| CPU/GPU | Apple Silicon M1 | M2 Pro / M3 / M4 (Metal acceleration) |
| Unified memory (Q4_K_M) | 8 GB | 16 GB |
| Unified memory (Q8_0) | 16 GB | 24 GB |
| Disk | ~3 GB for Q4_K_M, ~5 GB for Q8_0, plus ~600 MB for SNAC + dependencies | — |
| Python | 3.10 | 3.11 or 3.12 |
| LM Studio | 0.3.x or newer | latest stable |
For reference, the per-quant footprints on the GGUF model card are: Q4_K_S 2.40 GB, Q4_K_M 2.49 GB, Q8_0 4.03 GB. Add ~500 MB at runtime for the SNAC decoder and Python overhead.
Recommended install path: LM Studio + Python client
This is the path Canopy Labs themselves recommend in the model-card discussion thread for Mac users. It avoids vLLM and CUDA-built PyTorch entirely.
Step 1 — Install Homebrew, Python, and Git
Skip the ones you already have.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11 git git-lfs
git lfs install
Step 2 — Install LM Studio and load the Orpheus GGUF
- Download LM Studio from lmstudio.ai (universal Apple Silicon build).
- Open it, hit the search icon, and search for
isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF. Download. - Switch to the Developer tab. Load the model. Confirm Metal is the active backend (not CPU).
- Start the local server. Default endpoint is
http://127.0.0.1:1234/v1.
If you prefer raw llama.cpp or llama-server, the same GGUF will load. Pass --rope-scaling=linear and a --ctx-size matching your ORPHEUS_MAX_TOKENS so long passages don't get truncated. llama.cpp upstream tracks Orpheus support in issue #12476.
Step 3 — Clone the Python client
git clone https://github.com/isaiahbjork/orpheus-tts-local.git
cd orpheus-tts-local
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The client is small — it speaks to LM Studio's OpenAI-compatible endpoint, parses the SNAC-token stream, decodes through hubertsiuzdak/snac_24khz on the CPU/Metal, and writes a 24 kHz mono WAV.
Step 4 — Generate speech
python gguf_orpheus.py \
--text "Hello from a Mac. <laugh> This is Orpheus running fully offline." \
--voice tara \
--output hello.wav
Open hello.wav in QuickTime or Music. Expect 1.5–4× real-time on an M3 with 16 GB unified memory using Q4_K_M; closer to real-time on an 8 GB M1.
Step 5 — If you must call it from your own Python
The 2025 version of this guide shipped a pipeline("text-to-speech") snippet that returns {"audio": ...}. That code does not work for Orpheus and never did — Orpheus is not a registered HF TTS pipeline. Here is the corrected path, talking to the LM Studio server directly:
import requests, json, wave, numpy as np
from snac import SNAC # pip install snac
LM_STUDIO = "http://127.0.0.1:1234/v1/completions"
MODEL = "orpheus-3b-0.1-ft"
PROMPT = "tara: Hello from a Mac. <laugh> This is Orpheus running offline."
# 1. Stream audio tokens from LM Studio
r = requests.post(LM_STUDIO, json={
"model": MODEL,
"prompt": PROMPT,
"max_tokens": 1200,
"temperature": 0.6,
"top_p": 0.9,
"stream": False,
})
token_text = r.json()["choices"][0]["text"]
# 2. Parse SNAC token IDs (the client repo has a robust parser; this is a sketch)
ids = [int(t) for t in token_text.split() if t.isdigit()]
frames = [ids[i:i+7] for i in range(0, len(ids) - 6, 7)]
# 3. Decode through SNAC at 24 kHz
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
audio = snac_model.decode(frames) # shape: (1, 1, samples)
# 4. Write a 24 kHz mono WAV
pcm = (audio.squeeze().cpu().numpy() * 32767).astype(np.int16)
with wave.open("output.wav", "wb") as f:
f.setnchannels(1); f.setsampwidth(2); f.setframerate(24000)
f.writeframes(pcm.tobytes())
If this looks like more wiring than you wanted, that is exactly why the orpheus-tts-local client exists — it bundles the SNAC parsing and the WAV write so you only call gguf_orpheus.py. Use the snippet above only when you need to embed Orpheus inside a larger Python service, e.g. a voice agent that already orchestrates an LLM through the same OpenAI-compatible interface — the same pattern we describe in the OpenClaw + Ollama setup guide for running local AI agents.
What was removed from the 2025 version (and why)
tts = pipeline("text-to-speech", model="canopylabs/orpheus-3b-0.1-pretrained")— Orpheus does not register a TTS pipeline. The pipeline returns text-completion logits, not audio. Removed.git lfs pullon the Canopy repo to "download the model" — the canopy repo doesn't ship weights via LFS; weights live on Hugging Face. Removed.- "At least 8 GB RAM, GPU recommended" with no per-quant numbers — replaced with the per-quant table above.
- The
pretrainedbase as a usage default — the-ft(production fine-tune) is what every consumer integration ships;-pretrainedis for downstream fine-tuning only. Updated.
Voices, emotion tags, and prompt format
The fine-tuned English model exposes 8 voices. Pick one by prefixing the prompt with the voice name and a colon:
| Voice | Gender | Character |
|---|---|---|
| tara | F | Default — conversational, clear (most demos use this) |
| leah | F | Warm, gentle |
| jess | F | Energetic, youthful |
| leo | M | Authoritative, deep |
| dan | M | Friendly, casual |
| mia | F | Professional, articulate |
| zac | M | Enthusiastic, dynamic |
| zoe | F | Calm, soothing |
Emotion tags are inline. The model was trained on these — they are not post-hoc filters:
<laugh>,<chuckle>,<giggle><sigh>,<groan>,<yawn>,<gasp><cough>,<sniffle>
Example prompt:
leo: I cannot believe you said that. <sigh> Let's start over.Multilingual: French, German, Spanish, Italian, Mandarin, Korean, Hindi
Canopy's April 2025 multilingual research preview ships eight languages and 24 voices in total. The models are listed under the canopylabs/orpheus-tts Hugging Face collection, e.g. canopylabs/3b-fr-ft-research_release for French and canopylabs/3b-hi-pretrain-research_release for Hindi. Multilingual quants are not as well-curated as English; expect to use the upstream BF16 weights on Mac through MLX or full-precision llama.cpp if you need a non-English voice. Treat this tier as research-preview, not production.
How to choose: should you use Orpheus, Kokoro, Sesame CSM, or ElevenLabs v3?
| If you need… | Pick | Why |
|---|---|---|
| Fully local, voice cloning + emotion tags | Orpheus 3B | Only fully-local Apache-2.0 model that ships both |
| Smallest footprint, fastest CPU TTS | Kokoro 82M | ~30–45 s for a 1,500-word passage on M1 8 GB |
| Most natural conversational tone, paralinguistics | Sesame CSM 1B | Best at non-verbal cues; weaker voice cloning OOTB |
| Maximum expressiveness, no latency concern | ElevenLabs v3 (cloud) | GA 14 March 2026; 70+ languages; not real-time |
| Real-time conversational agent (cloud OK) | ElevenLabs Flash v2.5 | ~75 ms latency; lower quality than v3 |
If you want a head-to-head with concrete numbers, our Orpheus 3B vs Kokoro and Orpheus vs Sesame CSM 1B comparisons go deeper on the trade-offs.
Performance — concrete 2026 numbers
From CodeSOTA's 2026 speech leaderboard and Inferless's 2025 12-model comparison (links in References):
- Quality (MOS): Orpheus-3b-0.1-ft scores ~4.2 — close to Sesame CSM 1B and within striking distance of ElevenLabs v2 on conversational text. ElevenLabs v3 (closed) sits at the top of the leaderboard.
- Word-error rate (CER): ~21% in the open-source comparison; Kokoro is lower at 17% but lacks voice cloning.
- Streaming latency: ~200 ms time-to-first-audio with batched generation, ~100 ms with input streaming, on RTX-class hardware. On M3/M4 expect 1.5–4× real-time generation with Q4_K_M, depending on context length.
- Audio: 24 kHz, 16-bit mono — fine for voice agents and audiobooks; not 48 kHz studio quality.
If you need cite-able benchmark detail, the CodeSOTA leaderboard is the most current public ranking that includes Orpheus.
Common pitfalls and troubleshooting
RuntimeError: Torch not compiled with CUDA enabled— you're running the upstreamorpheus-speechpackage on Apple Silicon. Switch to the GGUF + LM Studio path. This is the issue #178 case.- vLLM regression on Linux — pin
vllm==0.7.3. The Canopy README still calls this out. (Not relevant for Mac users on the GGUF path.) - Garbled or robotic audio — your prompt is being treated as text completion, not as a voice prompt. Make sure the prompt starts with
voicename:(e.g.tara:) and that the LM Studio server is loading the Orpheus GGUF, not a base Llama-3.2. - WAV is silent or 0 bytes — the SNAC frame parsing stopped early. Increase
--max_tokens(the Lex-au server defaults toORPHEUS_MAX_TOKENS, often 1024 — bump to 4096 for paragraph-length output). - Hugging Face login prompts — only required for gated multilingual research-release weights. The English Q4_K_M GGUF is not gated; you do not need a token for it.
- "Connection refused" from the Python client — LM Studio's local server is off. Re-open the Developer tab and click Start server. Confirm with
curl http://127.0.0.1:1234/v1/models. - Output is too fast / too slow — temperature and top_p matter for SNAC sampling. Canopy recommends
temperature=0.6,top_p=0.9,repetition_penalty=1.1. Defaults of 0.0 produce monotone deliveries. - Long generations cut off mid-word — increase
--ctx-sizeinllama.cppor LM Studio. Each second of audio uses ~150 SNAC tokens; a 30-second clip needs ~4500 tokens of headroom.
Production notes
- License: Apache 2.0 on both code and weights. The base Llama-3.2 license still governs the underlying weights; if you ship a commercial product, read both. Canopy's
-ftmodels are research-friendly and explicitly cleared for commercial use under Apache 2.0. - Throughput: for batch inference, run on a Linux box with vLLM and fp8 (Baseten partnership, May 2025). Apple Silicon is the right tool for one-user-at-a-time interactive use, not for scale-out batch synthesis.
- Voice cloning ethics: Orpheus does zero-shot cloning from ~10 seconds of reference audio. Get explicit consent from the speaker; in the EU, recorded voice is biometric data under the AI Act.
- If you're stitching Orpheus into a voice agent (Whisper-cpp for STT, a local LLM for reasoning, Orpheus for TTS), the architecture mirrors the one we describe in the OpenClaw vs LM Studio vs Ollama comparison. Hiring a Codersera engineer who has shipped this exact stack before usually saves the week of vLLM/Metal yak-shaving.
FAQ
Does Orpheus 3B run natively on Apple Silicon as of April 2026?
Not via the official Canopy Python package — it requires CUDA-built PyTorch and vLLM. The community path is GGUF through LM Studio (Metal) plus the orpheus-tts-local client. Issue #178 in the Canopy repo tracks the native ask but is unresolved.
How much RAM do I really need?
8 GB unified memory is enough for Q4_K_M Orpheus-3B-FT, but you'll see Metal eviction during long generations. 16 GB is the comfortable target. 24 GB or more lets you run Q8_0 plus a small reasoning LLM in parallel.
Can I clone my own voice?
Yes. Provide ~5–10 seconds of clean 24 kHz reference audio and prepend it to the prompt as a voice exemplar. The Canopy README has an explicit zero-shot-cloning recipe; quality scales with sample length and recording cleanliness.
What's the difference between orpheus-3b-0.1-pretrained and orpheus-3b-0.1-ft?
-pretrained is the 100k-hour base — useful as a starting point for fine-tunes and downstream tasks. -ft is the production fine-tune with the 8 named voices and emotion-tag training. For TTS, always use -ft.
Does it support languages other than English?
Yes — the April 2025 multilingual research preview adds French, German, Spanish, Italian, Mandarin, Korean, and Hindi (24 voices total). Treat them as research-grade; English is the only model that hit production maturity.
How does Orpheus compare to ElevenLabs v3?
ElevenLabs v3 (GA 14 March 2026) is more expressive and supports 70+ languages, but is closed-source, costs $0.17–$0.30 per 1,000 characters, and explicitly trades latency for quality (not real-time). Orpheus is the leading fully-local Apache-2.0 alternative with voice cloning and emotion tags.
Can I use Orpheus commercially?
Yes — both code and weights are Apache 2.0. The base Llama-3.2 license still applies to the underlying weights, so include attribution. Voice cloning specifically is your responsibility re: consent and biometrics regulation.
Why is my output choppy or popping at frame boundaries?
SNAC's standard decoder doesn't smooth across frame boundaries; Canopy ships a sliding-window CNN decoder for streaming. If you're rolling your own pipeline (not using orpheus-tts-local), make sure you're using their detokenizer, not the vanilla SNAC one.
References & further reading
- canopyai/Orpheus-TTS — official GitHub (architecture, training notes, Baseten/vLLM partnership)
- canopylabs/orpheus-3b-0.1-ft — Hugging Face model card (fine-tuned production model)
- canopylabs/orpheus-tts — Hugging Face collection (all variants including multilingual research releases)
- isaiahbjork/orpheus-tts-local — LM Studio Mac client (the path this guide recommends)
- canopyai/Orpheus-TTS issue #178 — Apple Silicon support request (open as of April 2026)
- llama.cpp issue #12476 — Orpheus support tracking
- CodeSOTA Speech Leaderboard 2026 (current TTS/STT benchmarks including Orpheus)
- Inferless: 12 best open-source TTS models compared (CER/MOS numbers cited above)
- ElevenLabs v3 product page (closed-source benchmark for context)
- Canopy Labs — multilingual release announcement