Orpheus 3B TTS vs. Sesame CSM 1B: AI Speech Synthesis Compared (2026)

Published 24 Mar 2025 • Updated 30 Apr 2026 • 11 min read

Orpheus 3B vs. Sesame CSM 1B

Last updated April 2026 — refreshed for current model/tool versions.

Orpheus 3B TTS and Sesame CSM 1B remain the two most-discussed open-source speech synthesis models for developers who need either expressive emotional control or contextual conversational realism. This post compares their architectures, benchmark data, hardware requirements, and integration patterns — with 2026-current numbers replacing the original estimates, and new context on the expanded Orpheus variant family and the competitive landscape that has shifted around both models.

What changed in 2026 — if you read the original March 2025 post:Orpheus variant family expanded: Seven multilingual model pairs (English, French, Spanish, German, Italian, Portuguese, Chinese) released in April 2025 research preview. The core model is now canopylabs/orpheus-3b-0.1-ft on Hugging Face, based on Llama-3.2-3B (not Llama-3B as originally stated).CSM-1B license changed to Apache 2.0: The model now uses Apache 2.0, not MIT as originally reported. It was also integrated natively into Hugging Face Transformers v4.52.1 (May 2025), making deployment significantly easier.MOS scores updated: Independent 2026 leaderboard data (CodeSOTA) places Sesame CSM at 4.7 MOS and Orpheus at 4.6 MOS — the original 4.8 / 4.2 split was based on internal or unverified evaluations.New major competitors: Dia 1.6B (Nari Labs, April 2025), Kokoro-82M, and Chatterbox-Turbo have entered the open-source TTS field and affect the decision between Orpheus and CSM in some use cases.AWS pricing revised: The g5.8xlarge now costs approximately $2.45/hr on-demand, not $0.12/hr as originally stated — likely the original figure referenced a different instance or was incorrect. Spot pricing varies; verify on AWS directly.ElevenLabs v3 is GA (March 2026): The commercial baseline has advanced significantly, with audio-tag emotional control, 70+ language support, and a new eleven_v3_conversational model — relevant context for teams weighing open-source against paid APIs.

TL;DR Comparison

Dimension	Orpheus 3B TTS	Sesame CSM 1B
Architecture	Llama-3.2-3B backbone + SNAC tokenizer (24kHz)	1B Llama backbone + 100M Mimi audio decoder
MOS (2026 leaderboard)	4.6	4.7
Streaming latency	~100–200ms	~50–150ms
VRAM requirement	~12GB+ GPU (fp16), quantized runs on 8GB	~2–4GB GPU; CPU fallback supported
Emotional control	Explicit XML/tag directives (`<laugh>`, `<sigh>`, etc.)	Implicit — derived from conversation context
Voice cloning	Zero-shot (text+speech pair conditioning)	Context-prompt approach (reference utterances)
Multi-turn dialogue	Single-turn primary design	Native multi-turn context window (4096 tokens)
Languages (2026)	7 (English plus multilingual research release)	English primary; limited other language support
License	Apache 2.0	Apache 2.0
Best for	Audiobooks, gaming NPCs, expressive narration	Conversational agents, call centers, IoT/edge

Architectural Foundations

Orpheus 3B TTS

Released on March 18, 2025 by Canopy Labs, Orpheus 3B uses Meta's Llama-3.2-3B-Instruct as its backbone — not the original Llama-3B as widely reported. The model is finetuned for high-quality, empathetic text-to-speech generation via the canopylabs/orpheus-3b-0.1-ft checkpoint. Key architectural facts:

Token sequences of 8192 tokens during training, allowing longer utterance generation without degradation
SNAC audio tokenizer at 24kHz for waveform generation (not 48kHz as previously reported)
Explicit emotion tag system: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp> — these are the eight supported emotion markers as of the current release, not the "32 defined states" in the original post
Eight built-in English voices: Tara, Leah, Jess, Leo, Dan, Mia, Zac, Zoe
Trained on 100,000+ hours of English speech combined with billions of textual QA tokens

For local AI inference pipelines, Orpheus integrates cleanly with Ollama via the community-maintained legraphista/Orpheus model — making it compatible with setups described in the OpenClaw + Ollama setup guide for running local AI agents if you want a unified local orchestration layer.

Sesame CSM 1B

Released on February 27, 2025 by Sesame AI Labs, CSM-1B is architecturally more conservative and deployment-friendly. The model employs:

1B Llama-based backbone paired with a 100M transformer decoder that produces Mimi split-RVQ audio tokens
4096-token audio context window, enabling the model to track roughly 8 minutes of prior conversation and modulate output accordingly
No pre-trained speaker voices: Voice identity is established by passing reference audio utterances as context — this is the intended design, not a limitation
Native Hugging Face Transformers support from v4.52.1 (May 2025), enabling standard from_pretrained() loading, Trainer-based finetuning, and CUDA graph compilation

Sesame trained three model sizes internally (CSM-1B, CSM-3B, CSM-8B) but has only released the 1B publicly. The larger variants power Sesame's commercial interactive voice demo.

Performance and Benchmark Data (2026)

The CodeSOTA Speech AI Benchmark leaderboard (April 2026) offers the most systematic cross-model MOS evaluation currently available. Key findings relevant to this comparison:

Model	MOS (April 2026)	Type	Notes
ElevenLabs Turbo v2.5	4.8	Cloud API	Current SOTA, not self-hostable
Sesame CSM 1B	4.7	Open source	Best open-source conversational quality
OpenAI TTS HD	4.7	Cloud API	—
Cartesia Sonic 2	4.7	Cloud API	~90ms TTFB
Orpheus TTS 3B	4.6	Open source	Best open-source for expressive control
Kokoro-82M	~4.5	Open source	Fastest; 36× real-time on free Colab GPU

Important caveat: The CodeSOTA leaderboard notes that "MOS is subjective. Vendors publish different listener panels and reference tracks; direct comparison below 0.1 MOS should be treated as noise." The meaningful takeaway is that Sesame CSM and Orpheus are now in the same quality band as most commercial APIs — not substantially behind.

Hardware and Inference Requirements

Metric	Orpheus 3B	Sesame CSM 1B
Streaming latency	~200ms (reducible to ~100ms with input streaming)	~50–150ms
GPU VRAM (fp16)	~12GB (fits on RTX 3090/4090)	~2–4GB GPU; CPU supported
Quantized options	4-bit BnB, GGUF Q8 (community releases)	2 quantizations available on HF
Cloud inference cost	~$2.45/hr on AWS g5.8xlarge (on-demand A10G)	Much lower — runs on CPU or small GPU VMs
Edge deployment	Not ideal; minimum 8GB VRAM for quantized	Raspberry Pi 5 (8GB) viable for CPU path
Together AI API	Available (serverless, pay-per-token)	Available via DeepInfra

AWS pricing note: The original post cited $0.12/hr for Orpheus inference on AWS, which appears to have been incorrect. The AWS EC2 g5.8xlarge (NVIDIA A10G, 24GB VRAM) is priced at approximately $2.45/hr on-demand as of early 2026. Spot pricing can be 60–70% lower. For production deployments, Together AI's serverless endpoint removes the fixed-cost overhead entirely.

Feature Differentiation

Emotion Control

This is where the two models diverge most sharply in their design philosophy:

Orpheus — explicit emotion tags: The model accepts inline markup within the text prompt. For example:

text = "I can't believe that just happened! <gasp> Are you serious right now?"

The supported tags are: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>. Note that the original post claimed "32 defined emotional states" — this is not accurate per the current model card. Canopy provides exactly eight emotion tags at this release.

Sesame CSM — context-driven modulation: CSM does not take explicit emotion directives. Instead, it reads tonal cues from the prior conversation turns passed as context. The result is subtler, more naturalistic variation — appropriate for genuine conversational flow but less useful when you need a specific emotion on demand.

Voice Cloning

Orpheus: Zero-shot cloning using text+speech pair conditioning in the prompt. Reliability improves with more reference pairs. The 98.7% similarity score cited in the original post comes from an unverified internal evaluation and should not be taken as a benchmark figure.
Sesame CSM: Context-prompt approach — you pass prior utterances (with associated audio) per speaker as the model's context. CSM has no pre-built voices by design; voice identity emerges from what you provide. A minimum of 30 seconds of reference audio produces noticeably better consistency.

Conversational Persistence

Orpheus: Primarily single-turn architecture. While multi-turn use is possible by chaining calls, there is no built-in cross-turn state.
Sesame CSM: Natively multi-turn. The 4096-token context window (approximately 8 minutes of audio) allows the model to adapt intonation, pace, and emotional register across a long conversation. This is CSM's defining architectural advantage.

Language Support (April 2026)

Orpheus: The April 2025 research preview released seven multilingual model pairs: English, French, Spanish, German, Italian, Portuguese, and Chinese. Pretrained and finetuned variants exist for each language. Production quality varies by language; English remains best-supported.
Sesame CSM: English-primary. The model's Llama backbone supports other languages at a token level, but training data was predominantly English speech. Community finetunes for other languages exist (see Speechmatics' guide on finetuning CSM).

Technical Implementation

Orpheus 3B — Current Setup

Install via the official GitHub repository:

git clone https://github.com/canopyai/Orpheus-TTS
cd Orpheus-TTS
pip install -r requirements.txt

Basic generation with emotion tags:

from orpheus_tts import OrpheusModel

model = OrpheusModel(model_name="canopylabs/orpheus-3b-0.1-ft")
audio = model.generate_speech(
    prompt="That's incredible! <gasp> I had no idea you could do that.",
    voice="tara",                     # one of 8 built-in voices
    repetition_penalty=1.1
)

For quantized local inference (lower VRAM), use the community GGUF variant:

ollama pull legraphista/Orpheus

For serverless production use without managing GPU infrastructure:

import together

client = together.Together()
response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Your text here with <laugh> emotion tags",
    voice="leo",
)

Sesame CSM 1B — Current Setup

As of Transformers v4.52.1, CSM loads natively:

from transformers import CsmForConditionalGeneration, AutoProcessor
import torch

processor = AutoProcessor.from_pretrained("sesame/csm-1b")
model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b", torch_dtype=torch.bfloat16)
model = model.to("cuda")

# Build conversation context
conversation = [
    {"role": "0", "content": [{"type": "text", "text": "Hello! How can I help you today?"}]},
    {"role": "1", "content": [{"type": "text", "text": "I need some advice about my project."}]},
    {"role": "0", "content": [{"type": "text", "text": "Of course. Tell me more about what you're working on."}]},
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to("cuda")

audio = model.generate(**inputs, output_audio=True)

The CsmForConditionalGeneration API is the recommended path as of 2025. The older standalone ConversationEngine pattern from the original post reflected pre-Transformers integration code.

Optimal Use Cases

Choose Orpheus 3B When:

Audiobook production: Named-character narration with fine-grained emotional delivery per scene
Game NPC voices: Procedurally varied dialogue with explicit emotional beats synchronized to game state
Assistive technology: Screen readers that need affective modulation (grief counseling apps, accessibility tools)
Podcast synthesis: Single-narrator, scripted content where you want exact tone control
Multilingual content: Projects spanning English, French, Spanish, German, Italian, Portuguese, or Chinese

Choose Sesame CSM 1B When:

Live conversational agents: Call center automation, customer support bots, where each turn should reflect the prior exchange
Smart home / IoT: Edge deployment on Raspberry Pi 5 (8GB) or similar ARM hardware via the CPU path
E-learning tutors: Session-length conversations where emotional register must evolve naturally
Real-time voice APIs: Sub-150ms latency requirements with limited GPU budget
Research / finetuning: The HF Trainer integration makes domain-specific voice adaptation straightforward

New Competitors to Know in 2026

The open-source TTS landscape has expanded significantly since early 2025. Two models now affect the Orpheus-vs-CSM decision:

Dia 1.6B (Nari Labs, April 2025): Targets the same multi-speaker dialogue space as CSM but generates expressive nonverbal sounds (laughs, coughs, throat-clearing) from written cues like (laughs) — a capability Orpheus covers with tags and CSM only approximates through context. Dia is the better choice for scripted multi-character audio drama or podcast dialogue, but lacks CPU fallback and has no multilingual research release. License: Apache 2.0.
Kokoro-82M (Apache 2.0): At 82 million parameters, Kokoro runs at 36× real-time on a free Google Colab GPU and achieves MOS ~4.5 in independent testing. It has no voice cloning and limited emotion control, but for simple single-voice narration at very low cost, it outperforms both Orpheus and CSM on speed-to-first-audio.

Teams building production voice pipelines that need both fast iteration and quality should evaluate Kokoro for a quick baseline before investing in the larger models' infrastructure requirements.

Where ElevenLabs v3 Fits

ElevenLabs v3 reached general availability on March 14, 2026. It introduces audio tags for emotional control (similar to Orpheus's tag system), 70+ language support, and a new eleven_v3_conversational model for agent applications. The GA version costs approximately $0.17–0.30 per 1,000 characters depending on plan tier, and precludes real-time streaming (ElevenLabs recommends Flash v2.5 for sub-200ms use cases).

For teams that need commercial SLA, managed infrastructure, and 70+ languages, ElevenLabs v3 is competitive. For teams with GPU budget, data-privacy requirements, or on-device deployment needs, Orpheus and CSM remain the best self-hosted alternatives — and they are now close enough in MOS (4.6–4.7 vs. ElevenLabs Turbo's 4.8) that the quality gap is no longer a disqualifier.

How to Choose — Decision Guide

Do you need real-time multi-turn dialogue? Yes → Sesame CSM 1B. No → proceed to 2.
Do you need explicit, per-word emotional control? Yes → Orpheus 3B. No → proceed to 3.
Is VRAM your hard constraint (under 8GB GPU)? Yes → CSM 1B (CPU path) or Kokoro-82M. No → proceed to 4.
Do you need scripted multi-speaker dialogue with nonverbal sounds? Yes → consider Dia 1.6B alongside Orpheus. No → proceed to 5.
Is multilingual support (beyond English) required? Yes → Orpheus (7-language research release). For CSM, check community finetunes for your specific target language.
Is managed API and 70+ languages acceptable at cost? Yes → ElevenLabs v3 Flash for real-time, v3 GA for quality.

Common Pitfalls and Troubleshooting

Orpheus 3B

Output quality varies with temperature: The default sampling temperature can produce disfluencies on longer texts. Start with temperature=0.6 and repetition_penalty=1.1.
Unsupported emotion tags generate noise: Tags outside the eight documented ones are passed through as literal text or ignored silently. Validate your markup before production.
GGUF quantization reduces emotion reliability: 4-bit quantized variants (BnB, GGUF) can flatten emotional differentiation. Use fp16 for expressive output if VRAM allows.
Voice cloning requires paired references: Passing a single audio clip without a matching transcript reduces cloning fidelity. Always provide text+audio pairs.

Sesame CSM 1B

Empty context → generic voice: Running CSM without context utterances produces a flat, undifferentiated output. Always seed the model with at least one speaker reference segment.
CUDA 12.4+ required: The native Transformers implementation was tested on CUDA 12.4 and 12.6. Older CUDA versions may require the standalone repository path.
Transformers version pin: Requires transformers>=4.52.1. Older versions fail silently or fall back to an incompatible code path.
Non-English output is weak without finetuning: The base CSM-1B was trained primarily on English speech. For non-English use, apply a language-specific finetune (see Speechmatics' finetuning guide).

FAQ

Can I run Orpheus 3B without a GPU?

Yes, but performance degrades. The official repository supports Llama.cpp-based CPU inference via the GGUF community weights (lex-au/Orpheus-3b-FT-Q8_0.gguf). Expect 5–10× slower generation than GPU, making real-time streaming impractical on most consumer CPUs. For CPU-only deployments, Kokoro-82M or Sesame CSM 1B's CPU path are better fits.

What is Sesame's larger model (CSM-3B / CSM-8B)? Will they release it?

Sesame trained CSM-1B, CSM-3B, and CSM-8B. As of April 2026, only CSM-1B has been open-sourced. The 3B and 8B models power Sesame's commercial voice product. There is no announced timeline for their release.

Does Orpheus 3B support voice cloning from arbitrary reference speakers?

The pretrained checkpoint (canopylabs/orpheus-3b-0.1-pretrained) supports zero-shot cloning via text+speech pair conditioning. The finetuned checkpoint (canopylabs/orpheus-3b-0.1-ft) is optimized for its eight built-in voices. For custom speaker cloning, use the pretrained model and pass multiple reference pairs.

How does Dia 1.6B relate to Sesame CSM?

Both target conversational speech. Dia 1.6B (Nari Labs) generates nonverbal sounds from parenthetical cues and is better for scripted multi-character audio. CSM-1B adapts naturally to live conversational context and supports real-time inference with CPU fallback. They are complementary, not direct substitutes.

Is ElevenLabs v3 better than Orpheus/CSM for most production use?

On raw MOS (4.8 vs. 4.6–4.7), ElevenLabs Turbo v2.5 leads. But v3 GA cannot stream in real-time, costs $0.17–0.30/1,000 characters, and requires sending your audio data to a third-party API. Self-hosted Orpheus or CSM gives you privacy, no per-character cost, and now a comparable quality range for most applications.

What are the best cloud providers for running these models in 2026?

For Orpheus 3B: Together AI offers a serverless endpoint with no infrastructure overhead. For custom hosting, AWS g5.xlarge (~$1.006/hr, A10G, 24GB VRAM) handles full fp16 inference. For CSM 1B: DigitalOcean GPU Droplets (NVIDIA L4, starting ~$0.50/hr) are well-documented and cheap for the model's smaller footprint. DeepInfra also provides a hosted CSM-1B endpoint.

Can I finetune CSM-1B on my own voice data?

Yes. Native HF Transformers Trainer support (from v4.52.1) makes finetuning straightforward. Speechmatics published a practical guide for finetuning CSM on new languages and voice styles. Unsloth also supports Orpheus 3B finetuning with 4-bit quantization for memory-efficient fine-tuning on consumer GPUs.

Are these models suitable for medical or accessibility applications?

Both are Apache 2.0 licensed and technically suitable for commercial use including medical. Key considerations: neither model has been clinically validated; both carry standard AI ethics caveats against non-consensual voice cloning; Orpheus's emotion tags make it more appropriate for applications like ALS voice banking where specific affective qualities are required. Review Canopy's and Sesame's ethical use guidelines before deploying in sensitive contexts.

Orpheus 3B TTS vs. Sesame CSM 1B: AI Speech Synthesis Compared (2026)

TL;DR Comparison

Architectural Foundations

Orpheus 3B TTS

Sesame CSM 1B

Performance and Benchmark Data (2026)

Hardware and Inference Requirements

Feature Differentiation

Emotion Control

Voice Cloning

Conversational Persistence

Language Support (April 2026)

Technical Implementation

Orpheus 3B — Current Setup

Sesame CSM 1B — Current Setup

Optimal Use Cases

Choose Orpheus 3B When:

Choose Sesame CSM 1B When:

New Competitors to Know in 2026

Where ElevenLabs v3 Fits

How to Choose — Decision Guide

Common Pitfalls and Troubleshooting

Orpheus 3B

Sesame CSM 1B

FAQ

Can I run Orpheus 3B without a GPU?

What is Sesame's larger model (CSM-3B / CSM-8B)? Will they release it?

Does Orpheus 3B support voice cloning from arbitrary reference speakers?

How does Dia 1.6B relate to Sesame CSM?

Is ElevenLabs v3 better than Orpheus/CSM for most production use?

What are the best cloud providers for running these models in 2026?

Can I finetune CSM-1B on my own voice data?

Are these models suitable for medical or accessibility applications?

References and Further Reading

Sign up for more like this.