Last updated April 2026 — refreshed for current model/tool versions.
Orpheus 3B TTS and Sesame CSM 1B remain the two most-discussed open-source speech synthesis models for developers who need either expressive emotional control or contextual conversational realism. This post compares their architectures, benchmark data, hardware requirements, and integration patterns — with 2026-current numbers replacing the original estimates, and new context on the expanded Orpheus variant family and the competitive landscape that has shifted around both models.
What changed in 2026 — if you read the original March 2025 post:Orpheus variant family expanded: Seven multilingual model pairs (English, French, Spanish, German, Italian, Portuguese, Chinese) released in April 2025 research preview. The core model is nowcanopylabs/orpheus-3b-0.1-fton Hugging Face, based on Llama-3.2-3B (not Llama-3B as originally stated).CSM-1B license changed to Apache 2.0: The model now uses Apache 2.0, not MIT as originally reported. It was also integrated natively into Hugging Face Transformers v4.52.1 (May 2025), making deployment significantly easier.MOS scores updated: Independent 2026 leaderboard data (CodeSOTA) places Sesame CSM at 4.7 MOS and Orpheus at 4.6 MOS — the original 4.8 / 4.2 split was based on internal or unverified evaluations.New major competitors: Dia 1.6B (Nari Labs, April 2025), Kokoro-82M, and Chatterbox-Turbo have entered the open-source TTS field and affect the decision between Orpheus and CSM in some use cases.AWS pricing revised: The g5.8xlarge now costs approximately $2.45/hr on-demand, not $0.12/hr as originally stated — likely the original figure referenced a different instance or was incorrect. Spot pricing varies; verify on AWS directly.ElevenLabs v3 is GA (March 2026): The commercial baseline has advanced significantly, with audio-tag emotional control, 70+ language support, and a neweleven_v3_conversationalmodel — relevant context for teams weighing open-source against paid APIs.
TL;DR Comparison
| Dimension | Orpheus 3B TTS | Sesame CSM 1B |
|---|---|---|
| Architecture | Llama-3.2-3B backbone + SNAC tokenizer (24kHz) | 1B Llama backbone + 100M Mimi audio decoder |
| MOS (2026 leaderboard) | 4.6 | 4.7 |
| Streaming latency | ~100–200ms | ~50–150ms |
| VRAM requirement | ~12GB+ GPU (fp16), quantized runs on 8GB | ~2–4GB GPU; CPU fallback supported |
| Emotional control | Explicit XML/tag directives (<laugh>, <sigh>, etc.) |
Implicit — derived from conversation context |
| Voice cloning | Zero-shot (text+speech pair conditioning) | Context-prompt approach (reference utterances) |
| Multi-turn dialogue | Single-turn primary design | Native multi-turn context window (4096 tokens) |
| Languages (2026) | 7 (English plus multilingual research release) | English primary; limited other language support |
| License | Apache 2.0 | Apache 2.0 |
| Best for | Audiobooks, gaming NPCs, expressive narration | Conversational agents, call centers, IoT/edge |
Architectural Foundations
Orpheus 3B TTS
Released on March 18, 2025 by Canopy Labs, Orpheus 3B uses Meta's Llama-3.2-3B-Instruct as its backbone — not the original Llama-3B as widely reported. The model is finetuned for high-quality, empathetic text-to-speech generation via the canopylabs/orpheus-3b-0.1-ft checkpoint. Key architectural facts:
- Token sequences of 8192 tokens during training, allowing longer utterance generation without degradation
- SNAC audio tokenizer at 24kHz for waveform generation (not 48kHz as previously reported)
- Explicit emotion tag system:
<laugh>,<chuckle>,<sigh>,<cough>,<sniffle>,<groan>,<yawn>,<gasp>— these are the eight supported emotion markers as of the current release, not the "32 defined states" in the original post - Eight built-in English voices: Tara, Leah, Jess, Leo, Dan, Mia, Zac, Zoe
- Trained on 100,000+ hours of English speech combined with billions of textual QA tokens
For local AI inference pipelines, Orpheus integrates cleanly with Ollama via the community-maintained legraphista/Orpheus model — making it compatible with setups described in the OpenClaw + Ollama setup guide for running local AI agents if you want a unified local orchestration layer.
Sesame CSM 1B
Released on February 27, 2025 by Sesame AI Labs, CSM-1B is architecturally more conservative and deployment-friendly. The model employs:
- 1B Llama-based backbone paired with a 100M transformer decoder that produces Mimi split-RVQ audio tokens
- 4096-token audio context window, enabling the model to track roughly 8 minutes of prior conversation and modulate output accordingly
- No pre-trained speaker voices: Voice identity is established by passing reference audio utterances as context — this is the intended design, not a limitation
- Native Hugging Face Transformers support from v4.52.1 (May 2025), enabling standard
from_pretrained()loading, Trainer-based finetuning, and CUDA graph compilation
Sesame trained three model sizes internally (CSM-1B, CSM-3B, CSM-8B) but has only released the 1B publicly. The larger variants power Sesame's commercial interactive voice demo.
Performance and Benchmark Data (2026)
The CodeSOTA Speech AI Benchmark leaderboard (April 2026) offers the most systematic cross-model MOS evaluation currently available. Key findings relevant to this comparison:
| Model | MOS (April 2026) | Type | Notes |
|---|---|---|---|
| ElevenLabs Turbo v2.5 | 4.8 | Cloud API | Current SOTA, not self-hostable |
| Sesame CSM 1B | 4.7 | Open source | Best open-source conversational quality |
| OpenAI TTS HD | 4.7 | Cloud API | — |
| Cartesia Sonic 2 | 4.7 | Cloud API | ~90ms TTFB |
| Orpheus TTS 3B | 4.6 | Open source | Best open-source for expressive control |
| Kokoro-82M | ~4.5 | Open source | Fastest; 36× real-time on free Colab GPU |
Important caveat: The CodeSOTA leaderboard notes that "MOS is subjective. Vendors publish different listener panels and reference tracks; direct comparison below 0.1 MOS should be treated as noise." The meaningful takeaway is that Sesame CSM and Orpheus are now in the same quality band as most commercial APIs — not substantially behind.
Hardware and Inference Requirements
| Metric | Orpheus 3B | Sesame CSM 1B |
|---|---|---|
| Streaming latency | ~200ms (reducible to ~100ms with input streaming) | ~50–150ms |
| GPU VRAM (fp16) | ~12GB (fits on RTX 3090/4090) | ~2–4GB GPU; CPU supported |
| Quantized options | 4-bit BnB, GGUF Q8 (community releases) | 2 quantizations available on HF |
| Cloud inference cost | ~$2.45/hr on AWS g5.8xlarge (on-demand A10G) | Much lower — runs on CPU or small GPU VMs |
| Edge deployment | Not ideal; minimum 8GB VRAM for quantized | Raspberry Pi 5 (8GB) viable for CPU path |
| Together AI API | Available (serverless, pay-per-token) | Available via DeepInfra |
AWS pricing note: The original post cited $0.12/hr for Orpheus inference on AWS, which appears to have been incorrect. The AWS EC2 g5.8xlarge (NVIDIA A10G, 24GB VRAM) is priced at approximately $2.45/hr on-demand as of early 2026. Spot pricing can be 60–70% lower. For production deployments, Together AI's serverless endpoint removes the fixed-cost overhead entirely.
Feature Differentiation
Emotion Control
This is where the two models diverge most sharply in their design philosophy:
Orpheus — explicit emotion tags: The model accepts inline markup within the text prompt. For example:
text = "I can't believe that just happened! <gasp> Are you serious right now?"
The supported tags are: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>. Note that the original post claimed "32 defined emotional states" — this is not accurate per the current model card. Canopy provides exactly eight emotion tags at this release.
Sesame CSM — context-driven modulation: CSM does not take explicit emotion directives. Instead, it reads tonal cues from the prior conversation turns passed as context. The result is subtler, more naturalistic variation — appropriate for genuine conversational flow but less useful when you need a specific emotion on demand.
Voice Cloning
- Orpheus: Zero-shot cloning using text+speech pair conditioning in the prompt. Reliability improves with more reference pairs. The 98.7% similarity score cited in the original post comes from an unverified internal evaluation and should not be taken as a benchmark figure.
- Sesame CSM: Context-prompt approach — you pass prior utterances (with associated audio) per speaker as the model's context. CSM has no pre-built voices by design; voice identity emerges from what you provide. A minimum of 30 seconds of reference audio produces noticeably better consistency.
Conversational Persistence
- Orpheus: Primarily single-turn architecture. While multi-turn use is possible by chaining calls, there is no built-in cross-turn state.
- Sesame CSM: Natively multi-turn. The 4096-token context window (approximately 8 minutes of audio) allows the model to adapt intonation, pace, and emotional register across a long conversation. This is CSM's defining architectural advantage.
Language Support (April 2026)
- Orpheus: The April 2025 research preview released seven multilingual model pairs: English, French, Spanish, German, Italian, Portuguese, and Chinese. Pretrained and finetuned variants exist for each language. Production quality varies by language; English remains best-supported.
- Sesame CSM: English-primary. The model's Llama backbone supports other languages at a token level, but training data was predominantly English speech. Community finetunes for other languages exist (see Speechmatics' guide on finetuning CSM).
Technical Implementation
Orpheus 3B — Current Setup
Install via the official GitHub repository:
git clone https://github.com/canopyai/Orpheus-TTS
cd Orpheus-TTS
pip install -r requirements.txt
Basic generation with emotion tags:
from orpheus_tts import OrpheusModel
model = OrpheusModel(model_name="canopylabs/orpheus-3b-0.1-ft")
audio = model.generate_speech(
prompt="That's incredible! <gasp> I had no idea you could do that.",
voice="tara", # one of 8 built-in voices
repetition_penalty=1.1
)
For quantized local inference (lower VRAM), use the community GGUF variant:
ollama pull legraphista/Orpheus
For serverless production use without managing GPU infrastructure:
import together
client = together.Together()
response = client.audio.speech.create(
model="canopylabs/orpheus-3b-0.1-ft",
input="Your text here with <laugh> emotion tags",
voice="leo",
)
Sesame CSM 1B — Current Setup
As of Transformers v4.52.1, CSM loads natively:
from transformers import CsmForConditionalGeneration, AutoProcessor
import torch
processor = AutoProcessor.from_pretrained("sesame/csm-1b")
model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b", torch_dtype=torch.bfloat16)
model = model.to("cuda")
# Build conversation context
conversation = [
{"role": "0", "content": [{"type": "text", "text": "Hello! How can I help you today?"}]},
{"role": "1", "content": [{"type": "text", "text": "I need some advice about my project."}]},
{"role": "0", "content": [{"type": "text", "text": "Of course. Tell me more about what you're working on."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to("cuda")
audio = model.generate(**inputs, output_audio=True)
The CsmForConditionalGeneration API is the recommended path as of 2025. The older standalone ConversationEngine pattern from the original post reflected pre-Transformers integration code.
Optimal Use Cases
Choose Orpheus 3B When:
- Audiobook production: Named-character narration with fine-grained emotional delivery per scene
- Game NPC voices: Procedurally varied dialogue with explicit emotional beats synchronized to game state
- Assistive technology: Screen readers that need affective modulation (grief counseling apps, accessibility tools)
- Podcast synthesis: Single-narrator, scripted content where you want exact tone control
- Multilingual content: Projects spanning English, French, Spanish, German, Italian, Portuguese, or Chinese
Choose Sesame CSM 1B When:
- Live conversational agents: Call center automation, customer support bots, where each turn should reflect the prior exchange
- Smart home / IoT: Edge deployment on Raspberry Pi 5 (8GB) or similar ARM hardware via the CPU path
- E-learning tutors: Session-length conversations where emotional register must evolve naturally
- Real-time voice APIs: Sub-150ms latency requirements with limited GPU budget
- Research / finetuning: The HF Trainer integration makes domain-specific voice adaptation straightforward
New Competitors to Know in 2026
The open-source TTS landscape has expanded significantly since early 2025. Two models now affect the Orpheus-vs-CSM decision:
- Dia 1.6B (Nari Labs, April 2025): Targets the same multi-speaker dialogue space as CSM but generates expressive nonverbal sounds (laughs, coughs, throat-clearing) from written cues like
(laughs)— a capability Orpheus covers with tags and CSM only approximates through context. Dia is the better choice for scripted multi-character audio drama or podcast dialogue, but lacks CPU fallback and has no multilingual research release. License: Apache 2.0. - Kokoro-82M (Apache 2.0): At 82 million parameters, Kokoro runs at 36× real-time on a free Google Colab GPU and achieves MOS ~4.5 in independent testing. It has no voice cloning and limited emotion control, but for simple single-voice narration at very low cost, it outperforms both Orpheus and CSM on speed-to-first-audio.
Teams building production voice pipelines that need both fast iteration and quality should evaluate Kokoro for a quick baseline before investing in the larger models' infrastructure requirements.
Where ElevenLabs v3 Fits
ElevenLabs v3 reached general availability on March 14, 2026. It introduces audio tags for emotional control (similar to Orpheus's tag system), 70+ language support, and a new eleven_v3_conversational model for agent applications. The GA version costs approximately $0.17–0.30 per 1,000 characters depending on plan tier, and precludes real-time streaming (ElevenLabs recommends Flash v2.5 for sub-200ms use cases).
For teams that need commercial SLA, managed infrastructure, and 70+ languages, ElevenLabs v3 is competitive. For teams with GPU budget, data-privacy requirements, or on-device deployment needs, Orpheus and CSM remain the best self-hosted alternatives — and they are now close enough in MOS (4.6–4.7 vs. ElevenLabs Turbo's 4.8) that the quality gap is no longer a disqualifier.
How to Choose — Decision Guide
- Do you need real-time multi-turn dialogue? Yes → Sesame CSM 1B. No → proceed to 2.
- Do you need explicit, per-word emotional control? Yes → Orpheus 3B. No → proceed to 3.
- Is VRAM your hard constraint (under 8GB GPU)? Yes → CSM 1B (CPU path) or Kokoro-82M. No → proceed to 4.
- Do you need scripted multi-speaker dialogue with nonverbal sounds? Yes → consider Dia 1.6B alongside Orpheus. No → proceed to 5.
- Is multilingual support (beyond English) required? Yes → Orpheus (7-language research release). For CSM, check community finetunes for your specific target language.
- Is managed API and 70+ languages acceptable at cost? Yes → ElevenLabs v3 Flash for real-time, v3 GA for quality.
Common Pitfalls and Troubleshooting
Orpheus 3B
- Output quality varies with temperature: The default sampling temperature can produce disfluencies on longer texts. Start with
temperature=0.6andrepetition_penalty=1.1. - Unsupported emotion tags generate noise: Tags outside the eight documented ones are passed through as literal text or ignored silently. Validate your markup before production.
- GGUF quantization reduces emotion reliability: 4-bit quantized variants (BnB, GGUF) can flatten emotional differentiation. Use fp16 for expressive output if VRAM allows.
- Voice cloning requires paired references: Passing a single audio clip without a matching transcript reduces cloning fidelity. Always provide text+audio pairs.
Sesame CSM 1B
- Empty context → generic voice: Running CSM without context utterances produces a flat, undifferentiated output. Always seed the model with at least one speaker reference segment.
- CUDA 12.4+ required: The native Transformers implementation was tested on CUDA 12.4 and 12.6. Older CUDA versions may require the standalone repository path.
- Transformers version pin: Requires
transformers>=4.52.1. Older versions fail silently or fall back to an incompatible code path. - Non-English output is weak without finetuning: The base CSM-1B was trained primarily on English speech. For non-English use, apply a language-specific finetune (see Speechmatics' finetuning guide).
FAQ
Can I run Orpheus 3B without a GPU?
Yes, but performance degrades. The official repository supports Llama.cpp-based CPU inference via the GGUF community weights (lex-au/Orpheus-3b-FT-Q8_0.gguf). Expect 5–10× slower generation than GPU, making real-time streaming impractical on most consumer CPUs. For CPU-only deployments, Kokoro-82M or Sesame CSM 1B's CPU path are better fits.
What is Sesame's larger model (CSM-3B / CSM-8B)? Will they release it?
Sesame trained CSM-1B, CSM-3B, and CSM-8B. As of April 2026, only CSM-1B has been open-sourced. The 3B and 8B models power Sesame's commercial voice product. There is no announced timeline for their release.
Does Orpheus 3B support voice cloning from arbitrary reference speakers?
The pretrained checkpoint (canopylabs/orpheus-3b-0.1-pretrained) supports zero-shot cloning via text+speech pair conditioning. The finetuned checkpoint (canopylabs/orpheus-3b-0.1-ft) is optimized for its eight built-in voices. For custom speaker cloning, use the pretrained model and pass multiple reference pairs.
How does Dia 1.6B relate to Sesame CSM?
Both target conversational speech. Dia 1.6B (Nari Labs) generates nonverbal sounds from parenthetical cues and is better for scripted multi-character audio. CSM-1B adapts naturally to live conversational context and supports real-time inference with CPU fallback. They are complementary, not direct substitutes.
Is ElevenLabs v3 better than Orpheus/CSM for most production use?
On raw MOS (4.8 vs. 4.6–4.7), ElevenLabs Turbo v2.5 leads. But v3 GA cannot stream in real-time, costs $0.17–0.30/1,000 characters, and requires sending your audio data to a third-party API. Self-hosted Orpheus or CSM gives you privacy, no per-character cost, and now a comparable quality range for most applications.
What are the best cloud providers for running these models in 2026?
For Orpheus 3B: Together AI offers a serverless endpoint with no infrastructure overhead. For custom hosting, AWS g5.xlarge (~$1.006/hr, A10G, 24GB VRAM) handles full fp16 inference. For CSM 1B: DigitalOcean GPU Droplets (NVIDIA L4, starting ~$0.50/hr) are well-documented and cheap for the model's smaller footprint. DeepInfra also provides a hosted CSM-1B endpoint.
Can I finetune CSM-1B on my own voice data?
Yes. Native HF Transformers Trainer support (from v4.52.1) makes finetuning straightforward. Speechmatics published a practical guide for finetuning CSM on new languages and voice styles. Unsloth also supports Orpheus 3B finetuning with 4-bit quantization for memory-efficient fine-tuning on consumer GPUs.
Are these models suitable for medical or accessibility applications?
Both are Apache 2.0 licensed and technically suitable for commercial use including medical. Key considerations: neither model has been clinically validated; both carry standard AI ethics caveats against non-consensual voice cloning; Orpheus's emotion tags make it more appropriate for applications like ALS voice banking where specific affective qualities are required. Review Canopy's and Sesame's ethical use guidelines before deploying in sensitive contexts.
References and Further Reading
- Orpheus TTS — Official GitHub Repository (Canopy Labs)
- canopylabs/orpheus-3b-0.1-ft — Hugging Face Model Card
- sesame/csm-1b — Hugging Face Model Card
- Sesame CSM — Official GitHub Repository (Sesame AI Labs)
- CSM in Hugging Face Transformers — Official Documentation
- CodeSOTA Speech AI Benchmarks 2026 — TTS and STT Leaderboard
- 12 Best Open-Source TTS Models Compared (Inferless, 2025)
- How to Finetune Sesame CSM on New Languages and Voices (Speechmatics)